What's the best storage mechanism (from the view of the database to be used and system for storing all the records) for a system built to track whois record changes? The program will be run once a day and a track should be kept of what the previous value was and what the new value is.
Suggestions on database and thoughts on how to store the different records/fields so that data is not redundant/duplicated
(Added) My thoughts on one mechanism to store data
Example case showing sale of one domain "sample.com" from personA to personB on 1/1/2010
Table_DomainNames
DomainId | DomainName
1 example.com
2 sample.com
Table_ChangeTrack
DomainId | DateTime | RegistrarId | RegistrantId | (others)
2 1/1/2009 1 1
2 1/1/2010 2 2
Table_Registrars
RegistrarId | RegistrarName
1 GoDaddy
2 1&1
Table_Registrants
RegistrantId | RegistrantName
1 PersonA
2 PersonB
All tables are "append-only". Does this model make sense? Table_ChangeTrack should be "added to" only when there is any change in ANY of the monitored fields.
Is there any way of making this more efficient / tighter from the size point-of-view??
The primary data is the existence or changes to the whois records. This suggests that your primary table be:
<id, domain, effective_date, detail_id>
where the detail_id points to actual whois data, likely normalized itself:
<detail_id, registrar_id, admin_id, tech_id, ...>
But do note that most registrars consider the information their property (whether it is or not) and have warnings like:
TERMS OF USE: You are not authorized
to access or query our Whois database
through the use of electronic
processes that are high-volume and
automated except as reasonably
necessary to register domain names or
modify existing registrations...
From which you can expect that they'll cut you off if you read their databases too much.
You could
store the checksum of a normalized form of the whois record data fields for comparison.
store the original and current version of the data (possibly in compressed form), if required.
store diffs of each detected change (possibly in compressed form), if required.
It is much like how incremental backup systems work. Maybe you can get further inspiration from there.
you can write vbscript in an excel file to go out and query a webpage (in this case, the particular 'whois' url for a specific site) and then store the results back to a worksheet in excel.
Related
i'm currently searching for a good approach to filter DB results based on permissions which are stored in another services DB.
Let me first show the current state:
There's one Document-Service with 2 tables (permission, document) in its MySQL DB. When documents for a user are requested, a paginated result should be returned. For brevity let's ignore the pagination for now.
Permission table: Document table:
user_id| document_id document_id| more columns
-------|------------ A
1 | A B
2 | A C
2 | B
2 | C
The following request "GET /documents/{userId}" will result in the following query against the DB:
SELECT d.* FROM document d JOIN permission p WHERE p.user_id = '{userId}' AND p.document_id = d.document_id;
That's the current implementation and now i am asked to move the permission table into its own service. I know, one would say that's not a good idea, but this question is just a broken down example and in the real scenario it's a more meaningful change than it looks like. So let's take it as a "must-do".
Now my problem: After i move the table into another DB, i cannot use it in the sql query of Document-Service anymore to filter results.
I also cannot query everything and filter in code, because there will be too much data AND i must use pagination which is currently implemented by LIMIT/OFFSET in the query (ignored in this example for brevity).
I am not allowed to access a DB from any other application except its service.
My question is: Is there any best practise or suggested approach for this kind of situation?
I already had 2 ideas which i would like to list here, even though i'm not really happy with either of them:
Query all document_ids of a user from the new Permission-Service and change the SQL to "SELECT * FROM document WHERE document_id IN {doc_id_array_from_permission_service}". The array could get pretty big and the statement slow; not happy about that.
Replicate the permission table into Document-Service DB on startup and keep the query as it is. But then i need to implement a logic/endpoint to update the table in the Document-Service whenever it changes in the Permission-Service otherwise it get's out of sync. This feels like i'm duplicating so much logic in both services.
For the sake of this answer, I'm going to assume that it is logical for Permissions to exist completely independently of Documents. That is to say - if the ONLY place a Permission is relevant is with respect to a DocumentID, it probably does not make sense to split them up.
That being the case, either of the two options you laid out could work okay; both have their caveats.
Option 1: Request Documents with ID Array
This could work, and in your simplified example you could handle pagination prior to making the request to the Documents service. But, this requires a coordinating service (or an API gateway) that understands the logic of the intended actions here. It's doable, but it's not terribly portable and might be tough to make performant. It also leaves you the challenge of now maintaining a full, current list of DocumentIDs in your Permissions service which feels upside-down. Not to mention the fact that if you have Permissions related to other entities, those all have to get propagated as well. Suddenly your Permissions service is dabbling in lots of areas not specifically related to permissions.
Option 2: Eventual Consistency
This is the approach I would take. Are you using a Messaging Plane in your Microservices architecture? If so, this is where it shines! If not, you should look into it.
So, the way this would work is any time you make a change to Permissions, your Permissions Service generates a permissionUpdatedForDocument event containing the relevant new/changed Permissions info. Your Documents service (and any other service that cares about permissions) subscribes to these events and stores its own local copy of relevant information. This lets you keep your join, pagination, and well-bounded functionality within the Documents service.
There are still some challenges. I'd try to keep your Permissions service away from holding a list of all the DocumentID values. That may or may not be possible. Are your permissions Role or Group-based? Or, are they document-specific? What Permissions does the Documents service understand?
If permissions are indeed tied explicitly to individual documents, and especially if there are different levels of permission (instead of just binary yes/no access), then you'll have to rethink the structure in your Permissions service a bit. Something like this:
Permission table:
user_id| entity_type| entity_id | permission_type
-------|------------|-----------|----------------
1 | document | A | rwcd
2 | document | A | r
2 | document | B | rw
2 | document | C | rw
1 | other | R | rw
Then, you'll need to publish serviceXPermissionUpdate events from any Service that understands permissions for its entities whenever those permissions change. Your Permissions service will subscribe to those and update its own data. When it does, it will generate its own event and your Documents service will see confirmation that its change has been processed and accepted.
This sounds like a lot of complication, but it's easy to implement, performant, and does a nice job of keeping each service pretty well contained. The Messaging plane is where they interact with each other, and only via well-defined contracts (message names, formats, etc.).
Good luck!
I have a database that I am trying to set up and I would like it to be in at least 3NF. However, some fields are not necessary in all situations, and the necessity of this field, not the value itself, depends on another.
In essence, I want to keep track of jobs that are on hold for one reason or another.
My main table right now includes these fields:
Job No (primary Key) | Short Text | Storage Location | Coordinator
I have other tables for employee list and storage locations. Now my problem is if the job is in the storage location "LAB," then it will have an associated Lab Ticket number that I want to track. I will have another table of Lab Tickets that contains status, ECD, etc. If the storage location is "MR" then the job should have a Notification number, and a separate table will contain info about the Notifications.
Although a job can only have 1 storage location at any given time, it can move. For instance, if a job goes to "LAB" and fails the test, it will get moved to "MR" and have a Notification created.
Is it a violation of 3NF, or otherwise just bad form, to have my tblJobs have fields:
Job No (primary Key) | Short Text | Storage Location | Coordinator | Lab Ticket | Notification | ...
even if not all fields are populated or used for every job? BTW I'm using MS Access, though I don't think that matters.
Edit: I see the related posts about Null values, but my question is less about the programming (I can easily enter a non-null value [ex. "N/A"] in the not-applicable fields), and more on an abstract database design level: In short, is it bad form to have fields that may not apply to a majority of the records? I normally hate seeing a bunch of N/A fields in any table, but I'm starting think some well thought-out Queries will allow me to see only the relevant information for a specific subset. Ex. for all items in "LAB", show the lab number status.
I've developed an iPhone app that allows users to send/receive files to/from a server.
At this point I wish to add a database to my server side application (sockets in c#) in order to keep track of the following:
personal card (name,age,email,etc...) - a user can (but isn't obligated) to fill one out
the number of files a user sent and received so far
app Stats which are sent every once in a while and contains info such as number of errors in the app, he's OS version etc...
the number of total files sent/received in the past hour (for all users)
each user has a unique 36 digit hex number "AF41-FB11-.....-FFFF"
The DB should provide the following answers: which users are "heavy users", how many files were exchanged in the past day/hour/month and is there a correlation between OS and number of errors.
Unfortunately I'm not very familiar with DB design, so I thought about the following:
a users table which will contain:
uniqueUserID | Parsonal Card | Files Sent | Files Received
an App stats table (each user can have many records)
uniqueUserID | number_of errors | OS_version | .... | sumbission_date_time
a general stats table (new record added every hour)
| total_files_received_in_the_last_hour | total_files_sent_in_the_last_hour | submission_date_time
My questions are:
performance-wise, does it make sense to collect and store data per user in side the server side application, and toss it all into the DB once an hour (e.g. open a connection, UPDATE fields/INSERT fields, close the connection) ? Or should I simply update each transaction (send/receive file) a user does every time he performs it?
Should I create a different primary key, other than the 36-digit id?
Does this design make sense??
I'm using mySQL 5.1, innoDB, the DBMS is on the same machine as the server-side app
Any insights will be helpful!
I will make my question very simple.
I have a ruby on rails app, backed with mysql.
I will click a button in page-1, it will goto page-2 and list a
table of 10 company's name.
Now, this list of ten companies are randomly generated(based on the
logic behind that clicked button in page-1) from COMPANIES table
which has 10k company names.
How do I calculate the count of the number of times a COMPANY name
being displayed on page-2, for a day.
Examaple: At the end of day - 1
COMPANY_NAME | COUNT
A | 2300
B | 100
C | 500
D | 10000
Now, from the research I have done, raw inserts will be costly and I learned there are 2 common ways to do it.
Open an unix file, write into it. At the end of the day, INSERT the content to the database.
Negative : File system if accessed concurrently, it will lead to lock issues.
Memcache the count and bulk insert into the DB.
What is the best way to do it in rails?
Are there any other ways to do this?
use redis. redis is good for maintaining in-memory data. It has atomic increment (an other data structures)
I have the following data which I want to save in my DB (this is used for sending text messages via a 3rd party API)
text_id, text_message, text_time, (array)text_contacts
text_contacts contains a normal array with all the contact_id's
How should I properly store the data in a MySQL database?
I was thinking myself either on 2 ways:
Make the array with contact_id's in a json_encoded (no need for serializing since it's not multi-dimensional) string, and store it in a text field in the DB
Make a second table with the text_id and all contact_id's on a new row..
note: The data stored in the text_contacts array does not need to be changed at any time.
note2: The data is used as individual contact_id to get the phone number from the contact, and check whether the text message has actually been sent.. (with a combination of text_id, and phonenumber)
What is more efficiƫnt, and why?
This is completely dependent upon your expected usage characteristics. If you will have a near-term need to query based upon the contact_ids, then store them independently as in your second solution. If you're storing them for archival purposes, and don't expect them to be used dynamically, you're as well off saving the time and storing them in a JSON string. It's all about the usage.
IMO, go with the second table, mapping text-ids to contact-ids. Will be easier to manipulate than storing all the contacts in one field
This topic will bring in quite a few opinions, but my belief: second table, by all means.
If you ever have a case where you actually need to search by that data, it will not require you to parse it before using it.
It is a heck of a lot easier to debug (for the same reason)
json_encode and json_decode (or equivalent) take far more time than a join does.
Lazy loading is easier, even if not necessary in most cases.
Others will find it more readable and, with a good schema definition, easier to conceptualize and maintain.
Almost all implementations would use one table for storing each text_contacts, and then a second table would use a foreign key to reference the text_contacts table. So, if say you had a table text_contacts that looked like this:
contact_id | name
1 | someone
2 | someone_else
And a text message table that looked like this:
text_id | text_message | text_time | text_contact
1 | "Hey" | 12:48 | 1
2 | "Hey" | 12:48 | 2
Each contact that has been sent a message would have a new entry in the text message table, with the last column referencing the contact_id field of the text_contacts table. This way makes it much easier to retrieve messages by contact, because you can say "select * from text_messages where text_contact = 1" instead of searching through each of the arrays on the single table to find the messages sent by a specific user.