How to design this "bus stations" database? - mysql

I want to design a database about bus stations. There're about 60 buses in the city, each of them contains these informations:
BusID
BusName
Lists of stations in the way (forward and back)
This database must be efficient in searching, for example, when user want to list buses which are go through A and B stations, it must run quickly.
In my first thought, I want to put stations in a seperate table, includes StationId and Station, and then list of stations will contains those StationIds. I guest it may work, but not sure that it's efficient.
How can I design this database?
Thank you very much.

Have you looked at Database Answers to see if there is a schema that fits your requirements?

I had to solve this problem and I used this :
Line
number
name
Station
name
latitude
longitude
is_terminal
Holiday
date
description
Route
line_id
from_terminal : station_id
to_terminal : station_id
Route schedule
route_id
is_holiday_schedule
starting_at
Route stop
route_id
station_id
elapsed_time_from_start : in minutes
Does it looks good for you ?

Some random thoughts based on travel on London buses In My Youth, because this could be quite complex I think.
You might need entities for the following:
Bus -- the physical entity, with a particular model (ie. particular seating capacity and disabled access, and dimensions etc) and VIN.
Bus stop -- the location at which a bus stops. Usually bus stops come in pairs, one for each side of the road, but sometimes they are on a one-way road.
Route -- a sequence of bus stops and the route between them (multiple possible roads exist). Sometimes buses do not run the entire route, or skip stops (fast service). Is a route just one direction, or is it both? Maybe a route is actually a loop, not a there-and-back.
Service -- a bus following a certain route
Scheduled Run -- an event when a bus on a particular service follows a particular route. It starts at some part of the route, ends at another part, and maybe skips certain stops (see 3).
Actual Run -- a particular bus followed a particular scheduled run. What time did it start, what time did it get to particular stops, how many people got on and off, what kind of ticket did they have?

(This sounds like homework, so I won't give a full answer.)
It seems like you just need a many-to-many relationship between buses and stops using 3 tables. A query with two inner joins will give you the buses that stop at two specific stops.

I'd hack it.
bus_id int
path varchar(max)
If a bus goes through the following stations (in this order):
01
03
09
17
28
Then I'd put in a record where path was set to
'-01-03-09-17-28-'
When someone wants to find a bus to get from station 03 to 28, then my select statement is
select * from buses where path like '%-03-%-28-%'
Not scalable, not elegant, but dead simple and won't churn through tables like mad when trying to find a route. Of course, it only works if there's a single bus that goes through the two stations in question.

what you have thought is good, in some cases it may or may not be efficient. I think that yo u should create tables as table1(BusID, BusName) table 2(Station List, Bus Id). I think this would would help. And try to use joins between these two tables to get the result. One more thing if possible try to normalize the tables that would help you.

I'd go for 3 tables :
bus
stations
bus_stations
"bus" for what the name stands for, "stations" for the station id's and names, and "bus_stations" to connnect those other 2 tables, wich would have bus_id, station_id_from station_id_to
This is probably more complex that you really need, but if, in the furure, you need to know the full trajectory of a bus, and also, from witch station one bus comes when it goes to "B station", will be usefull.
60 buses will not make that much impact in performance though.

Related

GTFS Data format: Non Unique Route id?

This is query about the requirements for routes file in a GTFS feed. If I understand correctly, a route is a set of trips that is spread out across a a time horizon. For example, if there is a bus travelling between stations A and B five times a day, these trips would be alloted one route ID.
Now suppose, there are two other stations, lets say C and D, distinct from A and B and not lying between A and B. Suppose these stations also have 5 trips running between them everyday.
If a GTFS feeds assigns these two sets of trips the same route ID, would this be a violation of the GTFS requirements?
One example of such a feed can be found here:https://gtfs.de/de/feeds/de_rv/
One example is the route with route id 22. This id is used for trips between stations that lie in two non-adjacent state (Nordrhein Westfalen and Baden Wüttenberg). The stations have no overlap.
Would this be violation of the GTFS specification?
Just a few days ago I had the same problem. Turns out that the creators of the feed put together routes with the same real world names (route_long_name).
In your case (route_id 22) it would be the s6 wich is propably also used in different cities.
I dont realy understand the logic behind this but acording to them it is still a valid feed.

SQL (MySQLi) best fit for text comparation

i have the following scenario:
DB table of addresses linked with region ID. Based on address, the workers sorting the packets (QR scanning) to the shelves and re-distributing them to the warehouses all around the capitol city. So far so good, everything seems OK, but there is a problem:
My DB table (MySQL) has the following fields:
ID (*auto increment, PK)*
STREET_NAT (*local name of the street - Cyrillic*) UTF8
STREET_EN (*English name of the street - Latin*) UTF8
REGION_ID (*number from 1 to 116 , that describes in what part of the town (warehouse) will be the package distributed*)
The problem is, sometimes the addresses are not correctly written plus as a bonus, sometimes they are in Cyrillic, sometimes in Latin.
I need to create a sorting system that analyzes the best fit of the street address and decides in which part of the city will travel the package. But the people makes mistakes (for example they are not entering "Jules Verne str." , but "Jul Vern st." , or even in Cyrillic with mistakes.
So my question: Does exists some procedure/method in MySQL to guess the best fit for the address? I am thinking in point system based on
php:
$query = "
SELECT REGION_ID FROM ADDRESSES WHERE STREET_NAT LIKE '%{$scanned_address}%'
OR STREET_EN LIKE '%{$scanned_address}%' "
this system works in approx 55% of the cases,when the sender of the package does not makes a mistake. I need to improve this select to add something like "Points" how close is the scanned address to the database field value. Best fit wins, and the region ID will be shown and sorted to the corresponding shelve. I am talking about thousands of packets / day.
Any ideas?
Thanks
i have an idea, create a table with the street correct names in it, in Israel we have this data for free, then you can compare from what the user types to the data in your database table with the correct street names. so it's auto complete.
you don't need to insert the values your self. just find a data source that can provide you this data, and it will be autocomplete for the user.

Classpass.com like database design

I am trying to get my head around creating classpass like database design. I'm new to database design and there are a few things that are not quite for me how to implement them and I can't quite get my head around.
You can check the classpass example:
https://classpass.com/classes
https://classpass.com/studios
EDIT 1: So here is the idea: Each city have multiple neighbourhoods having multiple studios/venues.
After reading spencer7593's comment, here is what I came with and the things that are still not quite clear:
So what I am not quite sure about is:
I am not sure how to store the venue/studio address and geolocation. Is it better to have table Region which defines id | name | parent_id and stores the cities and the neighborhoods recursively? Or add a foreign key constraint to city and neighborhoods? Should I store the lan/lon into the venue table, into the address or even separate locations table? I would like to be able to perform searches like:
show me venues in that neighborhood or city
show me venues which are in radius XX from position
Each class should have a schedule and currently I am not sure how to design it. For example: Spinning class, Mo, We, Fr from 9 AM till 10 AM. I would like
to be able to do queries like:
show me venues, which have spinning classes on Mo
or show me all classes in category Spinning, Boxing for example
or even show me venues offering spinning classes
Should I create an extra table schedules here? Or just create some kind of view which creates the schedule? If it's an extra table, how should I describe start, end of each day of the week?
#Dimitar,
Even though #rhavendc is correct, this question should be placed in Database Adminstrator, I will answer your question in respective order to the best of my knowledge.
I am not sure how to store the venue/studio address and geolocation. [...]
You can easily find Geo-Locations by searching on the web. take MyGeoPosition for example.
I would like to be able to perform searches like
show me venues in that neighborhood or city.
You can do this easily. There are a few ways to do it, and each way will require a bit of tweaking with your ERD design. With the example I attached below, you can run a query to list all the venues with the address_id followed by the city id. The yellow entities are the one I added to ensure integrity.
For example:
-- venue.name is using the "[table].[field]" format to help
-- the engine recognize where the field is coming from.
-- This is useful if you are pulling the fields of the
-- same name from different tables.
select venue.name, city.name
from venue join
address using (address_id) join
city using (city_id);
NOTE: You don't have to include the city_name. I just threw it in there so you can try it out to see all the venues matching it.
If you would like to do it by the neighborhood, you would have to tweak the ERD I gave you by adding neighbor_id in the ADDRESS table. I have attached the example below, You would also have to add neighborhood_id From there, you can run a query like this:
Using this ERD:
-- Remember the format from the previously mentioned code.
select venue.name, neighborhood.name
from venue join
address using (address_id) join
neighborhood using (neighbor_id);
show me venues which are in radius XX from position
You can calculate the amount of miles, kilometers, etc. from longitude and latitude using Haversine's Formula.
Each class should have a schedule and currently I am not sure how to design it. For example: Spinning class, Mo, We, Fr from 9 AM till 10 AM. I would like to be able to do queries like:
show me venues, which have spinning classes on Mo
or show me all classes in category Spinning, Boxing for example
or even show me venues offering spinning classes
This can be easily derived from either of the ERDs I attached here. In the CLASS table, I added a field called parent_class_id which gets the class_id from the same table. This uses recursion, and I know this is a bit of a headache to understand. This recursion will allow the classes with assigned parent class to show that the classes are also offered at different times.
You can get this result by doing so:
-- Remember the format from the previously mentioned code.
select class1.name, class1.class_id, class2.class_id
from class as class1,
class as class2
where class1.parent_class_id = class2.class_id;
or even show me venues offering spinning classes
This may be a tricky one... If you are wondering which venues are offering spinning classes, where spinning is either part of or the name of the class, not a category, it's simple.
Try this...
-- Remember the format from the previously mentioned code.
select venue_id
from venue join
class using (venue_id)
where class_name = 'spinning';
NOTE: Keep in mind that most SQL languages are case-sensitive when it comes to searching for literals. You could try using where UPPER(class_name) = 'SPINNING'.
If the class name may include words other than "spinning" in its name, use this instead: where UPPER(class_name) like '%SPINNING%'.
If you are wondering which classes are offering spinning classes where spinning is a category, that's where the tricky bit comes in. I believe you would have to use a subquery for this.
Try this:
-- Remember the format from the previously mentioned code.
select class_id
from class join
class_category using (class_id)
where cat_id = (select cat_id
from category
where name = 'spinning');
Again, SQL engines are usually sensitive when it comes to literal searches. Make sure your cases are in its correct upper or lower cases.
Should I create an extra table schedules here? Or just create some kind of view which creates the schedule? If it's an extra table, how should I describe start, end of each day of the week?
Yes and no. You could, but if you can understand recursion in database systems, you don't have to.
Hope this helps. :)
Entity Relationship Modeling.
An entity is a person, place, thing, concept or event that can be uniquely identified, is important to the business, and we can store information about.
Based on information in the question, some candidates to consider as entities might be:
studio
class
rating
neighborhood
city
For each entity, what uniquely identifies it? Figure out the candidate keys.
And figure out the relationships between the entities, and the cardinalities. (What is related to what, and how many, required or optional?)
Is a studio related to a class?
Can a studio have more than one class?
Can a studio have zero classes?
Can a class be related to more than one studio?
Is a neighborhood related to zero, one or more city?
Can a studio be related to more than one neighborhood?
Once you've got the entities and relationships, getting the attributes assigned to each entity is pretty straightforward. Just make sure every attribute is dependent on the key, the whole key, and nothing but the key.
FIRST
Your question is not suited to be posted here in Stack Overflow for I guess it's best to be posted in Database Administrators.
SECOND
Here are some info for reading, just to give you a good start for building your database:
Data Modeling (It's kinda broad but it's for the better)
Logical Data Model (Short but comprehensive one)
THIRD
Basically, when designing your database you should first know all the data that would be needed in your system and group them (if needed) to make it small. Normalize it to reduce data redundancy.
EXAMPLE
Let's assume that table venue would be your main table or the center of all the transaction in your system. By that, venue may have subdata for example branch that may hold different branch location... and that branch may have subdata too for example schedule, teacher and/or class which may also related to each other (subdata gets data from another subdata)... so forth and so on with dependent tables.
Then you can also create independent tables but still have connections with others. For example the neighborhood table, it may contain the neighbor location and main venue location (so it should get the id of selected venue from the venuetable)... so forth and so on with related and independent tables.
NOTE
Just remember the "one-to-one, one-to-many" relationship. If a data will be going to hold many kinds of subdata, just split them in different table. If a data will be going to hold only (1) kind of subdata, then put it all in one table.

Data matching/ deduplication Sql server 2008 R2

What are the options for making a data cleansing process (deduplication/matching)
when dealing with MS SQL Server 2008 R2?
Or better yet how can I weight scores on a matching process on columns of a row?
The situation is the following: I have a persons table on my database and their associated addresses and documents in other database tables?
How can I make the best decision of match based on Name, Serial no of the document and address? As I understood SSIS fuzzy groping won't support this feature: weighted scoring.
I do not have much experience with SSIS at the moment - so this answer is focused on the de-duping/matching/scoring aspect of your question.
There are many ways to approach a Data Quality strategy such as this, all of which have Pro's and Cons and I think a lot of it comes down to your existing data management strategies - how clean and standardised is the data you are trying to dedupe?
Even 'simple' items like telephone numbers can be difficult to dedupe if you have not got this correct - for example all of these are different representations of the same number:
+1 (888) 707-8822
1-888-707-8822
18887078822
001 888 7078822
888-7078822
The more complex structures such as addresses get even more interesting: are 'flat 2' and 'apartment 2' the same thing or different?
You have two choices - make it your self or trust a third party
Make it yourself
Advantages
Lots of fun logical problems to work through
Will be able to tweak and improve at will 'forever' as your solution grows
Disadvantages
It will take a lot of time.
Each country you use will need looking at separately - there are no high quality 'global' rules that you can apply (but there of of course snippets that can be reused)
Third Party
Advantages
If de-duplication is not your specialty - let the experts do it
Ready to go and deliver value immediately
Disadvantages
Cost
Whether you go your own route or third party I suggest you start by creating a clear goal.
What are your inputs:
How 'clean' is your data?
How standardised is your data?
How do the records link together.
Are the address records just from one country or are they from several.
What are your workflows:
How often do you need to run this process?
Do you want to stop duplicates entering your system in the first place or just run periodic bulk runs?
What do you want from the project?
To what level (document, person, household, organisation - see below) do you want to identify duplicates
What do you want to do with those duplicates
Delete duplicates and keep one record
Merge duplicates to create one master record
This stage is sometimes refereed to as creating the 'Golden' record. Deciding which information to keep, and which information to disregard.
To go into a bit more detail about some of those choices, consider the following dummy addresses:
Are you trying to dedupe to household level:
Ann Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Ann and Bob Smith, 1 Main St, DupeVille, MA, 12345-6789
Person Level
Robert Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Robert Smith, 1 Main St, DupeVille, MA, 12345-6789
or even by the ID's in your document database.
Once you have that plan, it may help you make up your mind about the best route to take. If you want to create it yourself, the links you have found certainly put you in the right mindset. If you want to go third party - there are a good range of suppliers out there. Just make sure you choose someone you can trust - they're going to be changing your data!
Google around for the various suppliers - Experian Data Quality are one of them (my company!) and depending upon where in the world you are, you can find your best contact details and more info here: http://www.qas.com/contact/office-locations.htm . We have tools that can integrate with SQL Server 2008 R2 which can score differing input types and then automatically dedupe these for you or return the clusters of potentially groups for your to look after yourself.
Take your plan, and clear idea of what you need from them and discuss it with them. Whoever you choose will be able to talk you through your plan, discuss your goals and tell you if they are the right people for the job.
Think I went on a bit there :-) but hopefully that points you in the right direction - Good luck!
If you do fuzzy grouping with multiple columns you will get _similarity information for every column you choose as input. With this similarity information you can calculate your own tresholds etc.

How to store a lot of different timetables in MYSQL?

I need help about how to model a database. I need to store the timetable for each transport public line. Lets see what we have...
I have different lines (bus number 100, 101, 102 and so on).
Each line has different stops and I need to store the coordinates of each one of them.
Each stop has a specific timetable, for example:
http://rozklady.mpk.krakow.pl/aktualne/0106/0106t001.htm
http://rozklady.mpk.krakow.pl/aktualne/0106/0106t003.htm
The aim of the program that I'm developing is to check for errors in the official timetables. Each bus has a tracking GPS device that sends its position to a database every 10 seconds. So I must check the hour of the reports whose coordinates are close to the coordinates of one of the stops and compare that time with the official time, and in case there is a big difference, create a row in other table STATISTICS reporting the issue.
Anyway, this was just for the context. The truth is that I don't have any clue about how to store it in an efficient way.
I thought about creating a table with the Stops: STOP_ID (PK) - NAME - LAT - LON - LINE - TIMETABLE
Where timetable would be an array containing all the times serialized for that stop [5:03,5:25,5:50,6:12,...].
Although I think this is not a good solution, I can't think about a better approach.
Maybe I could create a table for the stops, and other for timetables, but what would be the columns for timetables? I have so many variables... if it's weekly, saturday or holiday, a lot of hours, minutes... and all different for each stop.
Could you share any thoughts about how to face this problem? Thank you very much!!
As Simon mentioned, you are starting a big project.
Suggestion: Read up on the various normal forms for relational DBMSs; this will give you some helpful background if you don't have it.
What are your entities (tables)?
Bus lines (consider the outbound trip and return trip to be two different lines).
Stations on those lines, ordered.
Trips (e.g. 106 bus leaves central station at 05:22, another trip at 05:42, etc).
Scheduled-stops
GPS observations.
Here are possible tables and columns:
Busline table: one row for each busline.
Busline e.g. 106-outbound or 108-inbound (pk)
Description
Station table: one row for each bus stop, including ends of trips
Busline part of pk, fk to Busline e.g. 106
Stationid part of pk kf to Station
Description e.g. Second Avenue Eastbound at Houston Street
lat
long
Trip table: One row for each bus trip.
Tripid pk
Busline fk to Busline
Description e.g. 05:22 trip Central Station to University Park
Schedule table: one row for each scheduled time for each trip at each stop
Scheduleid pk ... ascending serial number.
Busline fk to Station
Stationid fk to Station
Tripid fk to Trip
Time
Observation table a row for each of your GPS readings
Observationid pk ... ascending serial number
Busline if you know it fk to Busline
Tripid if you have it fk to Trip
Time
Lat
Long
My advice with RDBMS design is to avoid serializing multiple items of data into single DBMS columns. That's why I have suggested the Schedule table.
Once you figure out how to load your Busline, Station, Trip, and Schedule tables, and you've loaded your observations into the Observation table, it will be an interesting exercise to correlate your observations with your schedules.
Be careful! You may embarrass your municipal transport department! :-)