I have a hazard model on a Weibull distribution, using the Stata command streg with the options nohr and time appended to that line of code. At least, that's the code from the do file I downloaded from a replication file.
If I have a new sliver of data, how do I compute the value of the model for that specific sliver of data? I would solve by hand in Excel (my wheelhouse is R or Python) but the closed form of the regression eludes me. I'm not sure from the documentation on the command exactly how they're adding in the other regressors and the Weibull regression has a lot of parameters that I'd rather not manually chug at. I'm hoping someone can help with what I believe is a simple out-of-sample forecast in a language I simply do not use.
infile warnum frstyear lastyear ccode1 ccode2 length logleng censor oadm oada oadp omdm omda omdp opdm opda opdp durscale rterrain rterrstr summperb sumpopbg popratbg bofadjbg qualratb salscale reprsumb demosumb surpdiff nactors adis3010 using 1perwarf.raw
stset length, id(warnum) fail(censor)
streg oadm oada oadp opda rterrain rterrstr bofadjbg summperb sumpopbg popratbg qualratb surpdiff salscale reprsumb demosumb adis3010 nactors, dist(weibull) nohr time
Related
so I have a lot of GPXs of users driving data from a game project where object which are placed on the road and then the user collects it. I want to somehow analyze these data to find out how users tend to drive given different objects, which ones draw them the most, which ones draw least. I have not done any data analysis before, so how can I analyze these data to get this sort of information? This might sound very novice, but yeah any help is appreciated.
You would probably like to do this in Python if you are novice, and then you can use a library like this one (gpxpy) to explore your data.
That is a GPX parser, I believe it will provide you with the data you like to see.
In their documentation you can see that you can use it like that :
import gpxpy
import gpxpy.gpx
# Open a file
gpx_file = open('yourfile.gpx', 'r')
# Parse the file
gpx = gpxpy.parse(gpx_file)
# Iterate over the tracks
for track in gpx.tracks:
for segment in track.segments:
for pt in segment.points:
print(f'Point at ({pt.latitude},{pt.longitude}) -> {pt.elevation}')
for waypoint in gpx.waypoints:
print(f'waypoint {waypoint.name} -> ({waypoint.latitude},{waypoint.longitude})')
for route in gpx.routes:
print('Route:')
for pt in route.points:
print(f'Point at ({pt.latitude},{pt.longitude}) -> {pt.elevation}')
Once you have those points you can calculate the distances, speeds, etc. from the coordinates.
I had a term project that needs to use data stored in MySQL to train a classification model using Tensorflow or whatever else.
I've tried to use examples from https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/keras/feature_columns.ipynb, and it took me a lot of time to process the data to a csv file and modify the python script. While I need to do a lot of experiments, is there may be much more simple tool for me to train and experiment on my MySQL dataset?
Maybe SQLFlow can meet your needs; I tried to build an SQLFlow script with the dataset you provided, she should be like this:
SELECT *
FROM Heart_Disease
TRAIN DNNClassifier /* a pre-defined TensorFlow estimator, tf.estimator.DNNClassifier */
WITH n_classes = 3, hidden_units = [10, 20] /* a parameter of the Estimator class constructor */
COLUMN Age, Sex, CP, FBS .. /* From the raw data, enter the columns that you think will help predict your heart rate. */
LABEL Target /* lable column */
INTO Heart_Disease.test_model; /* The trained model is saved to the specified data table */
It is also very easy to apply this model:
SELECT *
FROM Heart_Disease.predict
PREDICT Heart_Disease.predict_result.Target
USING Heart_Disease.test_model;
Heart_Disease.predict Target column is empty, The predicted Target is saved to the Heart_Disease.predict_result.Target table.
FYI:https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/demo.md
This is my first answer. Hope I can help you.
What you I think can do, is get the dump of data from sql if it's not realtime and not getting updated and then use that dump for the rest,
or you can create a connection of mysql and then feed that connection into pandas read_sql function, to get the dataframe.
A way to do that
Also if you're new to tensorflow, you should try looking at the tensorflow's estimator API that shall do your work, Apart from that you may use tensorflow's keras wrapper that also eases the work of making a NN network.
I have image save in 0.csv files.
The format is as picture below.
How can I read it to tensorflow?
Thanks!
You should use the Dataset input pipeline introduced in tensorflow 1.4:
https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
Here's the example from the developers guide (though you'll want to read through that guide, it's quite well written):
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))
The Dataset preprocessing pipeline has a few nice advantages. Most of the functionality you'll need such as reading text records, shuffling, batching, etc. are reduced to one-liners. More importantly though, it forces you into writing your preprocessing pipeline in a good, modular, testable way. It takes a little bit to get used to the API, but it's time well spent.
Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?
That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada
I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada
I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.
Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.
As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >
Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.