What is the best way to structure a database for a stock exchange? - mysql

I am trying to make a stock market simulator and i want to be as real as possible.
My question is: Nasdaq has 3000+ companies and in their database of stocks, right?! but is it one entry line for every share of every symbol on the sql db like the following example?
Company Microsoft = MSFT
db `companies_shares`
ID symbol price owner_id* company_id last_trade_datetime
1 msft 58.99 54334 101 2019-06-15 13:09:32
2 msft 58.99 54334 101 2019-06-15 13:09:32
3 msft 58.91 2231 101 2019-06-15 13:32:32
4 msft 58.91 544 101 2019-06-15 13:32:32
*owner_id = user id of the person that last bought the share.
Or is it calculate based on the shares available to trade and demand to buy and sell provided by the market maker? for exemple:
I've already tried the first example by it takes a lot space in my db and i'm concerned about the band width of all those trades, especially when millions of requests(trades) are being made every minute.
I've already tried the first example by it takes a lot space in my db and i'm concerned about the band width of all those trades, especially when millions of requests(trades) are being made every minute.
What is the best solution? Database or math?
Thanks in advance.

You might also want to Google many to many relationships.
Think about it this way. One person might own many stocks. One stock might be held by many people. That is a many to many relationship and usually modelled using three tables in a physical database. This is often written as M:M
Also, people might buy or sell a single company on multiple occasions this would likely be modelled using another table. From the person perspective there will be many transactions so we have a new type of relationship one (person) to many (transactions). This is often written as a 1:M relationship.
As to what to store, as a general rule it is best to store the atomic pieces of data. For example for a transaction, store the customer I'd, transaction date/time, the quantity bought or sold and the price at the very least.
You might also want to read up about normalization. Usually 3rd normal form is a good level to strive for, but a lot of this is a "it depends upon your circumstance and what you need to do". Often people will denormalize for speed of access at the expense of more storage and potentially more complicated updating....
You also mentioned performance, more often than not big companies such as NASDAQ. will use multiple layers if IT infrastructure. Each layer will have a different role and thus different functional characteristics and performance characteristics. Often they will be multiple servers operating together in a cluster. For example they might use a NoSQL system to manage the high volume of trading. From there there might be a feed (e.g. kafka) into other systems for other purposes (e.g. fraud prevention, analytics, reporting and so on).
You also mention data volumes. I do not know how much data you are talking about, but at one financial customer I've worked at have several peta bytes of storage (1 peta byte = 1000 TB) running on over 300 servers just for their analytics platform. They were probably on the medium to large size as far as financial institutions go.
I hope this helps point you in the right direction.

Related

Do I have too many columns in my MySQL table?

I'm a junior doctor and I'm creating a database system for my senior doctor.
Basically, my senior Dr wants to be able to store a whole lot of information on each of his patients in a relational database so that later, he can very easily and quickly analyse / audit the data (i.e. based on certain demographics, find which treatments result in better outcomes or which ethnicities respond better to certain treatments etc. etc.).
The information he wants to store for each patient is huge.
Each patient is to complete 7 surveys (each only takes 1-2 minutes) a number of times (immediately before their operation, immediately postop, 3months postop, 6months postop, 2years postop and 5years postop) - the final scores of each of these survey at these various times will be stored in the database.
Additionally, he wants to store their relevant details (name, ethnicity, gender, age etc etc).
Finally, he also intends to store A LOT of relevant past medical history, current symptoms, examination findings, the various treatment options they try and then outcome measures.
Basically, there's A LOT of info for each patient. And all of this info will be unique to each patient. So because of this, I've created one HUGE patient table with (~400 columns) to contain all of this info. The reason I've done this is because most of the columns in the table will NOT be redundant for each patient.
Additionally, this entire php / mysql database system is only going to live locally on his computer, it'll never be on the internet.
The tables won't have too many patients, maybe around 200 - 300 by the end of the year.
Given all of this, is it ok to have such a huge table?
Or should I be splitting it into smaller tables
i.e.
- Patient demographics
- Survey results
- Symptoms
- Treatments
etc. etc, with a unique "patient_id" being the link between each of these tables?
What would be the difference between the 2 approaches and which would be better? Why?
About the 400 columns...
Which, if any, of the columns will be important to search or sort on? Probably very few of them. Keep those few columns as columns.
What will you do with the rest? Probably you simply display them somewhere, using some app code to pretty-print them? So these may as well be in a big JSON string.
This avoids the EAV nightmares, yet stores the data in the database in a format that is actually reasonably easy (and fast) to use.

Optimal way of storing performance data for statistics (graphs)

Currently I'm working on a dashboard in PHP/MySQL which contains several statistics/facts such as: amount of items sold, revenue, gender (male/female) ratio of users etc. (all filterable on last week/month/year). The amount of data is (currently) not that much: 20.000 user rows, 1.000 items, 500 items sold per day but is expected to grow in the future, perhaps even exponentially.
Now, there is a wish to have several graphs displaying the performance to see whether strategy changes have impacts on the amount of users, revenue, gender ratio etc. For this, it is necessary to have numbers per day. Currently, the dashboard can only display "NOW() - 1 week/1 month/1 year" but for showing a graph outlining the growth, these numbers should be saved on a daily basis.
My question is: what are the options in this case? A cronjob could be set in place to save these numbers and write them to a separate 'performance' or 'history' table that saves the visitors, sales, gender ratio etc. in rows linked to the date of that day. This is good for performance, but certain data gets lost. Another option is to compute these numbers with complex queries (group by day) etc, but that seems to intensive since the queries are performed on the production database. Especially since the database structure is a little complex. Thinking of avoiding doing this on the production database, is setting up a data-warehouse with ETL-processes a better option to avoid overloading the production database? In that case the data would not be displayed live.
I honestly have no idea what is the best option in this case. I'm very curious about the answers! Many thanks.
Running query on a production database (especially one which is growing in volume and complexity) become a losing proposition very quickly. There are a lot of possible alternative, basically the entire field of Business Intelligence is grown as as solution of this problem.
For a small system where you just want to avoid to query the production database probably the development of a full blown Data Warehouse is overkill. It is impossible to give a reasonable answer without knowing more, but I would go for one of the following (in growing order of complexity/degree of result):
Instead of directly show the result of the query, save it in a table and query the table
Clone your production database then query the clone
Extract relevant data from production database in a structure which save relevant data and preserve history (google Data Vault)
Direct over the production DB, or over solution 2 or 3 build a dimensional model (google Kimball Dimensional Model). Pay attention that to do a good job you have to consider what kind of queries you want to do. You could end up with different designs for different requirement.
It is also relevant which technology are you using and what are the options available on your available architecture. Depending on what you have on hand, you could have some solution, even complex ones, very much simplified. Do some research.

Using Same Customer ID naming in DB with different Clients

I am looking for a better Customer ID numbering strategy than the competition has as they add a sequential strategy that no matter what the client ID is the numbering is continuous. So they may have a UCID that is not relevant to location, but only unique to software company. So in theory if I had 1000 locations using our software and each client has 20,000 customer it is obvious the customer ID number could be in the millions.
My thought is take a similar approach as the SSN is as we will have clients using our SaaS in all 50 states. I wanted to take the prefix approach using the number of when the state that the client is in was admitted to the US. So if it was Delaware the prefix would be 01-XXXX and if this was our 37 client in Delaware the ID would be 01-0037 and the very first customer that was entered into this client level DB table would automatically start at 1.
What would be the pros of this idea and what cons could come out of it. I am also hoping that this would allow for easier enterprise reporting
If you have a single database server doling out ids, then you can invent whatever scheme you want.
If you "distribute" the number-generating across many servers, then you must come up with some universally unique mechanism, such as UUIDs (which are not consecutive) or your 2-part mechanism (which is consecutive in the second part).
If you have one sequence per state, keep in mind that California will probably have 100 times as many numbers as the smallest states. Does this matter.
You should really think about why you need a "sequential" numbering strategy at all.
What will you do when there are holes in the numbers? Such will happen if there are server crashes, etc. Or when a 'client' vanishes (dies, goes out of business, moves out of state, etc).
Consecutive numbers are useful for Invoices -- so an audit can 'prove' that there are no missing or extra Invoices. But why for client ids?

MySQL Theory: Single DB for multiple companies

I have design question for MySQL. As a side project, I am attempting to create a cloud based safety management system. In the most basic terms, a company will subscribe to the service, which will manage company document record as blobs, corrective, employee information, audit results.
My initial design concept was to have a seperate DB for each company.
However, the question I have is if user access control is secure, would it be ok to have all the companies under one DB? What are the pitfalls of this? Are there any performance issues to consider? For identifying records, would it be a compound key of the company and referenceID number unique for each company? If so when generating a reference number for a record of a company, would it slow down as the record set increases?
In terms of constraints, I would expect up to 2000 companies and initially a maximum of 1000 records per company growing at 5% per year. I expect a maximum of 2 gig of blob storage per company growing at 10% per year. The system is to run one cloud server whether multiple db or one big one.
Any thoughts on this would be appreciated.
If there is not much inter-company interaction and overall frequent statistics and you don't plan to make application updates every week or so which would impact the DB structure, I'd go with separate DB (and DB user) for each company. It's more scalable, less prone to user access bugs and easier to make some operations such as remove a company.
On the other hand, 2 mil entries is not such a big deal and if you plan to develop the application further, keeping it in one DB could be better approach.
You have two question : performance and security.
If you use the same mysql user, security will not be different from one option to the other.
If you need performance, you can have the same results, running one or multiple databases (see for instance mysql partioning).
But there are others things that you should consider: like how it will be easy to have one database for your website... or like how it would be easy to have one database per user.
In fact, I give you an answer : considering the size of your data, don't make a choice on performance matters that are quite significantly equals for your needs, but on the choice that will make your life easy.

First-time database design: am I overengineering? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Background
I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.
My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.
My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.
We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)
The schema I have come up with is the following:
From the client perspective:
Clients is the main table
Clients are linked to the department they work for
Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
Departments are linked to the division of a company
Divisions are linked to the parent company
From the classes perspective:
Sessions is the main table
A teacher is linked to each session
A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
Sessions are grouped into "packs" of an arbitrary size
Each packs is assigned to a client
I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me: (Click here for full-sized graphic)
(source: maian.org)
Example queries I'll be running
Which clients with credit still left are inactive (those without a class scheduled in the future)
What is the attendance rate per client/department/division (measured by the status id in each session)
How many classes has a teacher had in a month
Flag clients who have low attendance rate
Custom reports for HR departments with attendance rates of people in their division
Question(s)
Is this overengineered or am I headed the right way?
Will the need to join multiple tables for most queries result in a big performance hit?
I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?
Thanks for your time
Some more answers to your questions:
1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!
2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.
That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.
For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.
All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.
You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.
What you can do is in the Departments table, add CityId and DivisionId.
Besides that, I think everything is fine...
The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.
2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.
3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.
4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.
No. It looks like you're designing at a good level of detail.
I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).
Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!
In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.
By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html . Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.
Most things have already been said, but I feel that I can add one thing: it is quite common for younger developers to worry about performance a little bit too much up-front, and your question about joining tables seems to go into that direction. This is a software development anti-pattern called 'Premature Optimization'. Try to banish that reflex from your mind :)
One more thing: Do you believe you really need the 'cities' and 'countries' tables? Wouldn't having a 'city' and 'country' column in the departments table suffice for your use cases? E.g. does your application need to list departments by city and cities by country?
Following comments based on role as a Business Intelligence/Reporting specialist and strategy/planning manager:
I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency.
Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table.
Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps.
You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site.
FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.
Regards.
I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.
I must say your design is impressive for a first year CS student.
It isn't over-engineered, this is how I would approach the problem. Joining is fine, there won't be much of a performance hit (it's completely necessary unless you de-normalise the database out which isn't recommended!). For statuses, see if you can use an enum datatype instead to optimise that table out.
I've worked in the training / school domain and I thought I'd point out that there's generally a M:1 relationship between what you call "sessions" (instances of a given course) and the course itself. In other words, your catalog offers the course ("Spanish 101" or whatever), but you might have two different instances of it during a single semester (Tu-Th taught by Smith, Wed-Fri taught by Jones).
Other than that, it looks like a good start. I bet you'll find that the client domain (graphs leading to "clients") is more complex than you've modeled, but don't go overboard with that until you've got some real data to guide you.
A few things came to mind:
The tables seemed geared to reporting, but not really running the business. I would think when a client signs up, there's essentially an order being placed for the client attending a list of sessions, and that order might be for multiple employees in one company. It would seem an "order" table would really be at the center of your system and driving your data capture and eventual reporting. (Compare the paper documents you've been using to run the business with your database design to see if there's a logical match.)
Companies often don't have divisions. Employees sometimes change divisions/departments, maybe even mid-session. Companies sometimes add/delete/rename divisions/departments. Make sure the possible realtime changing contents of your tables doesn't make subsequent reporting/grouping difficult. With so much contact data split over so many tables, you might have to enforce very strict data entry validation to keep your reports meaningful and inclusive. Eg, when a new client is added, making sure his company/division/department/city match the same values as his coworkers.
The "packs" concept isn't clear at all.
Since you indicate it's a small business, it would be surprising if performance would be an issue, considering the speed and capacity of current machines.