How to convert the following MySQL schema into CouchDB? - mysql

I am not sure how I can design the following problem in CouchDB.
I have a logger web app that keeps track of how many items are in a warehouse. To simplify the problem we just need to know the total number items currently in warehouse and how long is each item stays in warehouse before it ships. Lets say the warehouse only have shoes but each shoe have different id and need to keep track by id.
MySQL schema looks like this
id name date-in data-out
1 shoe 08/0/2010 null
2 shoe 07/20/2010 08/01/2010
The output will be
Number of shoe in warehouse: 1
Average time in warehouse: 14 days
Thanks

jhs' answer is great, but I just wanted to add something:
To use the build-in reduce function for the avg calculation (_stats in your case), you have to use two "separate" views. But if your map-function is exactly the same, CouchDB will detect that and not generate a whole new index for that second view. This way you can have one map function feeding multiple reduce functions.

If each shoe is a document, with a date_in and date_out, then your reduce function will +1 if the date_out is null, and +0 (no change) if date_out is not null. That will give you the total count of shoes in the warehouse.
To compute the average time, for each shoe, you know the time in the warehouse. So the reduce function simply accumulates the average. Since reduce functions must be commutative and associative, you use a different average algorithm. The easiest way is to reduce to a [sum, count] array, where sum is an accumulator of all time for all shoes, and count is a counter for the number of shoes counted. Then the client simply divides sum / count to compute the final average.
I think you could combine both of these into one big reduce if you want, perhaps building up a {"shoes in warehouse": 1, "average time in warehouse": [253, 15]} kind of object.
However, if you can accept two different views for this data, then there is a shortcut for the average. In the map, emit(null, time) where time is the time spent in the warehouse. In the reduce, set the entire reduce value to _stats (see Built-in reduce functions). The view output will be an object with the sum and count already computed.

Related

Anylogic: How to create an objective function using values of two dataset (for optimization experiment)?

In my Anylogic model I have a population of agents (4 terminals) were trucks arrive at, are being served and depart from. The terminals have two parameters (numberOfGates and servicetime) which influence the departures per hour of trucks leaving the terminals. Now I want to tune these two parameters, so that the amount of departures per hour is closest to reality (I know the actual departures per hour). I already have two datasets within each terminal agent, one with de amount of departures per hour that I simulate, and one with the observedDepartures from the data.
I already compare these two datasets in plots for every terminal:
Now I want to create an optimization experiment to tune the numberOfGates and servicetime of the terminals so that the departure dataset is the closest to the observedDepartures dataset. Does anyone know how to do create a(n) (objective) function for this optimization experiment the easiest way?
When I add a variable diff that is updated every hour by abs( departures - observedDepartures) and put root.diff in the optimization experiment, it gives me the eq(null) is not allowed. Use isNull() instead error, in a line that reads the database for the observedDepartures (see last picture), but it works when I run the simulation normally, it only gives this error when running the optimization experiment (I don't know why).
You can use the absolute value of the sum of the differences for each replication. That is, create a variable that logs the | difference | for each hour, call it diff. Then in the optimization experiment, minimize the value of the sum of that variable. In fact this is close to a typical regression model's objectives. There they use a more complex objective function, by minimizing the sum of the square of the differences.
A Calibration experiment already does (in a more mathematically correct way) what you are trying to do, using the in-built difference function to calculate the 'area between two curves' (which is what the optimisation is trying to minimise). You don't need to calculate differences or anything yourself. (There are two variants of the function to compare either two Data Sets (your case) or a Data Set and a Table Function (useful if your empirical data is not at the same time points as your synthetic simulated data).)
In your case it (the objective function) will need to be a sum of the differences between the empirical and simulated datasets for the 4 terminals (or possibly a weighted sum if the fit for some terminals is considered more important than for others).
So your objective is something like
difference(root.terminals(0).departures, root.terminals(0).observedDepartures)
+ difference(root.terminals(1).departures, root.terminals(1).observedDepartures)
+ difference(root.terminals(2).departures, root.terminals(2).observedDepartures)
+ difference(root.terminals(3).departures, root.terminals(2).observedDepartures)
(It would be better to calculate this for an arbitrary population of terminals in a function but this is the 'raw shape' of the code.)
A Calibration experiment is actually just a wizard which creates an Optimization experiment set up in a particular way (with a UI and all settings/code already created for you), so you can just use that objective in your existing Optimization experiment (but it won't have a built-in useful UI like a Calibration experiment). This also means you can still set this up in the Personal Learning Edition too (which doesn't have the Calibration experiment).

Would it be faster to make a python script to aggregate a table?, or would a built in SQL aggregate combined with polling be faster?

Currently, I have a little problem where I'm expected to build a table that shows the energy generated for the respected days.
I have solved this problem using python with SQL data polling combined with a for loop to look at the energy generated at the beginning of the day to the end of the day and the difference between the two will result in the total energy generated for the particular day. But unfortunately due to the amount of data that's coming out of the SQL database the python function is too slow.
I was wondering if this can be integrated within an SQL query to just spit out a table after it has done the aggregation. I have shown an example below for a better understanding of the table.
SQL TABLE
date/time
value
24/01/2022 2:00
2001
24/01/2022 4:00
2094
24/01/2022 14:00
3024
24/01/2022 17:00
4056
25/01/2022 2:00
4056
25/01/2022 4:00
4392
25/01/2022 17:00
5219
Final Table
From the above table, we can work that the energy generated for 24/01/2022 is 4056(max)-2001(min)= 2055
date
value
24/01/2022
2055
25/01/2022
1163
Usually, the time spent sending more stuff across the network makes the app-solution slower.
The GROUP BY may cost an extra sort, or it may be "free" if the data is sorted that way. (OK, you say unindexed.)
Show us the query and SHOW CREATE TABLE; we can help with indexing.
Generally, there is much less coding for the user if the work is done in SQL.
MySQL, in particular, picks between
Case 1: Sort the data O(N*log N), then make a linear pass through the data; this may or may not involve I/O which would add overhead
Case 2: Build a lookup table in RAM for collecting the grouped info, then making a linear pass over the data (no index needed); but then you need something like O(N*log n) for counting/summing/whatever the grouped value.
Notes:
I used N for the number or rows in the table and n for the number of rows in the output.
I do not know the conditions that would cause the Optimizer to pick one method versus the other.
If you drag all the data into the client, you would probably pick one of those algorithms. If you happen to know that you are grouping on a simple integer, the lookup (for the second algorithm) could be a simply array lookup -- O(N). But, as I say, the network cost is likely to kill the performance.
It is simple enough to write is SQL:
SELECT DATE(`date`) AS "day",
MAX(value) - MIN(value) AS range
FROM tbl
GROUP BY DATE(`date`);

SQL - How to find optimal performance numbers for query

First time here so forgive me for any faux pas. I have a question about the limitation of SQL as I am new to the code, and what I need I believe to be rather complex.
Is it possible to automate finding the optimal data for a specific query. For example, say I have the following columns:
1) Vehicle type (Text) e.g. car,bike,bus
2) Number of passengers (Numeric) e.g. 0-7
3) Was in an accident (Boolean) e.g. t or f
From here, I would like to get percentages. So if I were to select only cars with 3 passengers, what percentage of the total accidents does that account for.
I understand how to get this as a one off or mathematically calculate it, however my question relates how to automate this process to get the optimum number.
So, keeping with this example, say I look at just cars, what number of passengers covers the highest percentage of accidents?
At the moment, I am currently going through and testing number by number, is there a way to 'find' the optimal number? It is easy when it is just 0-7 like in the example, but I would naturally like to deal with a larger range and even multiple ranges. For example, say we add another variable titled:
4) Number of doors (numeric) e-g- 0-3
Would there be a way of finding the best combination of numbers from these two variables that cover the highest percentage of accidents?
So say we took: Car, >2 passengers, <3 doors on the vehicle. Out of the accidents variable 50% were true
But if we change that to:Car, >4 passengers, <3 doors. Out of the accidents variable 80% were true.
I hope I have explained this well. I understand that this is most likely not possible with SQL, however is there another way to find these optimum numbers?
Thanks in advance
Here's an example that will give you an answer for all possibilities. You could add a limit clause to show only the top answer, or add to the where clause to limit to specific terms.
SELECT
`vehicle_type`,
`num_passengers`,
sum(if(`in_accident`,1,0)) as `num_accidents`,
count(*) as `num_in_group`,
sum(if(`in_accident`,1,0)) / count(*) as `percent_accidents`
FROM `accidents`
GROUP BY `vehicle_type`,
`num_passengers`
ORDER BY sum(if(`in_accident`,1,0)) / count(*)

Which of these would be safer/better to run?

I have 451 cities with coordinates. Now I want to calculate the distance between each city and then order some results by that distance. Now I have 2 options:
I can run a loop that would calculate distance for every possible combination of cities and storing them into a table, which would result in roughly 200k rows.
Or, I can leave the cities without pre-calculating and then, when results are displayed (about 30 per page), and calculate the distance for each city separately.
I don't know which would be better for performance, but I would prefer going for option one, in which case I have another concern: Is there a way I could get out as little rows as possible? Currently, I would count the possibilities as 451^2, but I think I could divide that by 2, since the distance in case of City1-City2 is the same as City2-City1.
Thanks
If your table of cities is more or less static, then you should definitely per-calculate all distances and store them in a separate table. In this case you will have (451^2/2) rows (just make sure thet id of City1 is always lower then id of City2 (or another way round, doesn't really matter)).
Normally the cost of a single MySQL query is quite high and the cost of mathematical operations really low. Especially if the scale of your map is small and the required precision is low, so you can calculate with a fixed distance between degrees, you will be faster with calculating.
Furthermore you would have a problem if the number of cities rises because of a change in your project and therefore the number of combinations you'd have to store in the DB exceeds the limits.
So you'd probably better off without pre-calculating.

Criteria for storing calculation results versus making calculations at run time

One of my clients is a golf site. Every couple of weeks they come up with a new statistic they want available in various reports on individual rounds and covering all rounds.
For instance the percentages of putts made versus missed when attempted with under 10 feet of distance.
Some parameters:
report viewings occur more often than new record entries
Reports needs to run as quickly as possible and this is higher priority than saving quickly
Every hole in a round has a record
Every round has a record
Every user has a lifetime stats record
We are storing over 250 individual datapoints per round (holes record included)
The stats pages display about 100 individual calculations
My current approach has been to add fields to the hole/round/lifetime stats tables as new stats are needed and to calculate stats every time a round is saved.
The problem is that at some point we may well exceed mySQL's maximum row size of 65535 bytes.
So, the questions are:
Is there a point where I should start calculating statistics on the fly instead of storing them?
Alternately should I just plan on adding new stats tables to hold the overflow?
If you have indexes on the date, hole, user and round fields it should take very little time to calculate the lastest stats.
SELECT s.*
, 1 - perc_missed as perc_hit
FROM (
SELECT IFNULL(putts_missed / total_putts,1) as perc_missed
,player
,hole
FROM golf_stats gs
WHERE gs.playdate BETWEEN '2011-01-01' AND '2011-02-01'
GROUP BY gs.player, gs.hole ) AS s;