I need to collect statistics from my server application written in python. I am looking for some general guidance on how to setup models and exactly how to store the statistics information. I was thinking of storing and organizing all this information in a database, but my implementation is turning out to be too specific.
I need to collect stats like active users, requests processed and things like that over time.
Are there any guides or techniques out there to create some more generic statistics storage systems?
Like most software solutions there is no single solution that I can recommend that will solve your problem. But I have created a few similar programs and here's some things that I found that worked well.
Create an asynchronous logging service so the logging doesn't adversely affect your code's performance. (You need to be mindful of where you are storing your data, where it is processed, etc. because you can still significantly degrade performance if you're not careful.) I have found that creating a web service is often convenient.
Try and save as much information about the request as possible. In the future this will make it easier to add new queries and reports.
Normalize your data
Always include the time the action was performed. If you can capture run time that it typically useful too.
One approach is to do this by stages: Store activity logs, including requests and users, as text files. Later, mine the logs into data points (python should be able to do this easily). You may want to use the logging library for python for the logging stage. In general, start with high time-resolution logging, which you can later aggregate into hourly, daily, weekly summaries etc.
Related
Something I have searched for but cannot find a straight answer to is this:
For a given service, if there are two instances of that service deployed to two machines, do they share the same persistent store or do they have separate stores with some syncing mechanism (master/slave, clustering)?
E.g. I have a OrderService backed by MySQL. We're getting many orders in so I need to scale this service up, so we deploy a second OrderService. Where does its data come from?
It may sound silly but, to me, every discussion makes it seem like the service and database are a packaged unit that are deployed together. But few discussions mention what happens when you deploy a second service.
Posting this as an answer because it's too long for a comment.
Microservices are self contained components and as such are responsible for their own data. If you want the get to the data you have to talk to the service API. This applies mainly to different kinds of services (i.e. you don't share a database among services that offer different kinds of business functionality - that's bad practice because you couple services at the heap through the database and it's then easy to couple more things that would normally be done at the API level but it's more convenient to do them through the database => you risk loosing componentization).
But if you have the same kind of service then there are, as you mentioned, two obvious choices: share a database or have each service contain it's own database.
Now you have to ask yourself which solution do you chose:
Are these OrderServices of yours truly capable of working on their own, or do you need to have all the orders in the same database for reporting or access by other applications?
determine what is your actual bottleneck. Is it the database? If not then share the database. Is it the services? If not then distribute your data.
need to distribute the data? What are your choices, what are your needs? Do you need to be consistent all the time or eventual consistency is good enough? Do you need to have separate databases and synchronize them manually or does your database installation handle replication and partitioning out of the box?
etc
What I'm trying to say is that in this kind of situations the answer is: it depends. And something that we tech geeks often forget to do before embarking on such distributed/scalability/architecture journeys is to talk to business. Often business can handle a certain degree of inconsistencies, suboptimal processes or looking up data in more places instead of one (i.e. what you think is important might not necessarily be for business). So talk to them and see what they can tolerate. Might be cheaper to resolve something in an operational way than to invest a lot into trying to build a highly distributable system.
All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?
I have an app...
The app does a market comparison for a financial product - for a given quote request, it contacts several other sites for their quotes. It then gives the user the results - several quotes for their details.
To manage these requests they get saved to MySQL and then my app kicks in, picking up the pending quotes and farms these out to threads (all same Linux box) to process each site lookup.
I am using JRuby as I had thread/db related issues. Using Java threadpools to control the number of threads. With the current hardware/VPS - it can handle around 200 threads. A lot of the limitations seem to relate to each thread grabbing their own MySQL connection - grabbing the quote details and saving back the results. We want to handle more concurrent threads and so looking for ways to scale up.
Wondering which way to go ...
Bigger hardware...
More machines and use some kind of queueing
mechanism (with priorities) to share the load across the machines -
so the threads dont touch the db, all the details/responses go via
the queue - so the DB hit is less, but then maybe I am just pushing
the problem into the queue. Thinking of using something like
MongoDB for the queue, but open to suggestions - something easy to
use with Ruby :)
Some kind of remote/RPC mechanism, eg dRb -
theoretically this seems like a good option, but not done anything
with this yet to know how complex it will make things.
Something
else...?
From this link Reasons for NOT scaling-up vs. -out? - it would seem this problem is suited to running more machines to solve it.
So, any thoughts on which way to go...
Cheers,
Chris
My usual approach to problems like this is to pay very close attention to the database queries you're making and tune them aggressively. Retrieve only what you need, skipping columns that aren't explicitly used, and be very careful about eager loading things you don't need in their entirety.
You'll often find you can get significant speed gains by adding indexes, or strategically de-normalizing certain attributes in your database to avoid ugly, time-consuming JOIN operations.
Further, think about caching: The fastest database call is the one that's never made. It's not hard to leverage in something like Memcached to save the results of a moderately time-consuming record retrieval and if done carefully it's even easy to invalidate and expire this provided you channel your updates through a few methods.
For scheduling workers, a simple first-in, first-out queue can be implemented in Redis to off-load a lot of the processing overhead from MySQL itself. This is usually very simple to add if you follow an example.
A cache like Memcached can handle an extremely high amount of traffic, so whenever possible, cache against this to avoid hitting your database for every last thing.
If you've exhausted these options, it's time for more front-end servers and even more database capacity, but only then.
Queing is easiest thing for you to implement. Use something like this: http://beanstalkd.github.com/beaneater/
Basically you can prepend your methods with async. which will put them into queue and execute them. They queue and workers can be same server or a different one.
I am building an internal tool, which will be open-sourced, to take logs and put them into a database - to put it simply. From there, the tool will also analyze the logs and help alert the sys-admins and developers of issues going on, all in real-time. This is a lot of CPU to process this, more than the scope of this question.
What I would like to know is what Database to choose that will allow and perform quickly a number of key tasks:
Store a large number of events categorized by event types
Perform a large number of reads to develop charts to analyze the events that are being logged
Read in real-time to send and trigger automated alerts to the system.
And any other help would be greatly appreciated, too. Code On.
To my observation MongoDB performs in a magnitude better than RDBS for a task you describe - massive store of logs. Particularly good performers are capped collections. Major performance lag with RDBS I've seen was the insert times. Huge disadvantage of RDBS is the schema which is a major pain to upgrade if needed. Because of these reasons we have started to move towards MongoDB - check out logFaces. If you are building your own tool for the open source community - try to make sure it will work with ANY database, not just a particular brand. But then it becomes a not so trivial task :)
(for disclosure - I am the original author of logFaces, so the opinion could be biased)
Storing just events sound like a simple model, so you might want to take a look at NoSQL databases. I think key-value stores/bigtables for really large amounts of data will be better than document based databases in this case.
Large number of reads and analysing on the other hand sound like you might want to build a data warehouse system. This is the good old SQL approach, without some normalization for optimised reading. Though it can take some time to design and implement.
I've been doing some simple web programming using python, and I have a basic understanding of most of the parts involved in generating and serving web pages. However, I have only a tenuous grasp on the use of Relational Databases as a way to store and retrieve data. I do understand the basics of SQL queries and database design, but am having trouble understanding what I should be doing to allow for concurrent access (among other things).
With that in mind I have a couple fairly specific questions. However, for each question, I'm only partially interested in the answer to the question itself. I'm mostly interested in whether or not I'm asking these questions in the right way. So here it goes:
When using a relational database, how do you insure that multiple threads don't interfere with each other while writing to the database?
Could having multiple threads accessing a database create a situation in which the data they are reading are out of sync?
How should I manage permissions to read/write from a database?
Are there things that don't belong in a database (images, large chunks of text)?
I'd love any commentary on these specific questions, or a pointer to any resource that describes the correct way of thinking about using a relational database on the web.
a lot of your concerns are abstracted away by a DBMS. You don't generally need to stress the thread/concurrency related stuff. What you can do is group inserts/updates/queries into transactions to make them more atomic and ensure that all or nothing happens. such transactions can be rolled back if, for example, they are interfered with part way thru.
You don't mention what DB you use, but here is a small DB-agnostic intro to transaction. Of course you should also check out official documentation for your database.
http://www.sqlteam.com/article/introduction-to-transactions
As far as 'what things don't belong in a database', images and large chunks of text are fine. You can store binary blobs, you can store code if it makes sense for what you're doing. One thing i'd suggest is that you consider whether it is in your interest to directly store images in the DB or to store paths/filenames for files sitting on your server instead.
what I should be doing to allow for concurrent access
You let the database handle that, it's what it is designed for.
When using a relational database, how do you insure that multiple threads don't interfere with each other while writing to the database?
The database will handle this. Sometimes this will mean that one of the queries will abort in order to avoid a deadlock. You need to detect this in your code.
Could having multiple threads accessing a database create a situation in which the data they are reading are out of sync?
Yes, this is possible. Not much you can do about it - it is a consequence of multiple threads reading/writing the same data. There are synchronization commands that you can use, but these can have an effect on performance.
How should I manage permissions to read/write from a database?
Through the database security mechanism, whatever they are.
Are there things that don't belong in a database (images, large chunks of text)?
Large files, though even that depends on the application. Store application data in your database.
I would not expose a database directly to the web; I'd have a middle tier between clients and database to handle things like authentication and authorization, validation and binding, synchronization and isolation for database access, etc.
This would have the added benefit of letting me scale by adding more middle tier hardware.