I'm having a hard time finding a mixed-effects model that will fit my data.
My data has the following structure:
Participants have been divided into two disjointed groups
Each participant performs multiple trials
There are 3 types of trials
All participants perform the same trials
During each trial, we collect a metric every 20 ms.
We want to account for the changes between the groups and trial types.
So far, I've tried
Target~TimeTrialGroup+(1+Trial+Time|Participant)+(1|Participant:Group:Trial)
but the model's NLL value is very low (-230K)
Would appreciate any help,
Thank you
Related
I'm currently working on a project regarding a dataset that contains smartphone usage data from roundabout 200 users over a period of 4 months. For each user, I have a dataframe consisting of app-log events (Name of the App, Time, Location etc.). My goal is to predict the dwell time for the next app a user is going to open. I don't want to build one model for each user, but instead, I'm trying to build a model for all combined users. Now I'm struggling with finding an architecture that is suitable for this project.
The records are not evenly spaced in time, and the length of each dataframe differs. I want to utilize the temporal dependencies while simultaneously learn from multiple users at once, thus my input would be multiple parallel sequences of app usage durations with additional features and my output again multiple parallel sequences containing the dwell-time for the next app, but as the sequences are not evenly spaced in time nor have the same length it seems not suitable. I just wanted to get some ideas on how to structure the data properly and what you think would be a suitable approach. I would really appreciate some ideas or reading recommendations.
I am building a deep learning model for macro-economic prediction. However, different indicators varies widely when it comes to its subsampling time, ranging from minutes to annually.
Dataframe example
The picture contains the 'Treasury Rates (DGS1-20)' which is sampled daily and 'Inflation Rate(CPALT...)' which is sampled monthly. These features are essential for the model to train and dropping out the NaN rows would result in too little data.
I've read some books and articles about how to deal with missing data that includes down sampling to monthly time frames, swapping the NaNs with -1, filling it with averages between the last and next value etc. But the methods that I read mostly deals with data sets that has a missing value of about 10% of the whole dataset while in this case of mine, the monthly sampled 'Inflation(CPI)' is missing at 90+% if I combine it with the 'Treasury Rate' dataset.
I was wondering if there was any workaround to handle missing values, particularly for economic data where the sampling time gap ranges so widely. Thank you
I'm a junior doctor and I'm creating a database system for my senior doctor.
Basically, my senior Dr wants to be able to store a whole lot of information on each of his patients in a relational database so that later, he can very easily and quickly analyse / audit the data (i.e. based on certain demographics, find which treatments result in better outcomes or which ethnicities respond better to certain treatments etc. etc.).
The information he wants to store for each patient is huge.
Each patient is to complete 7 surveys (each only takes 1-2 minutes) a number of times (immediately before their operation, immediately postop, 3months postop, 6months postop, 2years postop and 5years postop) - the final scores of each of these survey at these various times will be stored in the database.
Additionally, he wants to store their relevant details (name, ethnicity, gender, age etc etc).
Finally, he also intends to store A LOT of relevant past medical history, current symptoms, examination findings, the various treatment options they try and then outcome measures.
Basically, there's A LOT of info for each patient. And all of this info will be unique to each patient. So because of this, I've created one HUGE patient table with (~400 columns) to contain all of this info. The reason I've done this is because most of the columns in the table will NOT be redundant for each patient.
Additionally, this entire php / mysql database system is only going to live locally on his computer, it'll never be on the internet.
The tables won't have too many patients, maybe around 200 - 300 by the end of the year.
Given all of this, is it ok to have such a huge table?
Or should I be splitting it into smaller tables
i.e.
- Patient demographics
- Survey results
- Symptoms
- Treatments
etc. etc, with a unique "patient_id" being the link between each of these tables?
What would be the difference between the 2 approaches and which would be better? Why?
About the 400 columns...
Which, if any, of the columns will be important to search or sort on? Probably very few of them. Keep those few columns as columns.
What will you do with the rest? Probably you simply display them somewhere, using some app code to pretty-print them? So these may as well be in a big JSON string.
This avoids the EAV nightmares, yet stores the data in the database in a format that is actually reasonably easy (and fast) to use.
I have to structure a MySQL database for work and haven't done that in years. I'd love to get some ideas from you. So here's the task:
I have a couple of "shops" that have, depending on the day of the week and year, different opening hours, which could change further down the line. The shops have
space for a given amount of people (which could change later as well).
A few times a day we count the amount of people in the shop.
We want to compare the utilized capacity between shops. I myself would like to use dc.js to be able to get as much stats as possible from the data.
We also have two different methods of counting our users:
By hand. Reliable, but time consuming.
Light barrier. Automatic, but very inaccurate.
I'd like to get a better approximation of the usercount using the light barrier data and some machine learning algorithm.
Anyway, do you have any tips on how to design the DB as efficiently as possible for my tasks. I was thinking:
SHOP
Id
Name
OPENINGHOURS
Id
ShopId
MaxUsers
Date
Open
Close
MANUALUSERCOUNT
Id
ShopId
Time
Count
AUTOUSERCOUNT
ID
ShopId
Time
Count
Does this structure make sense (at all and for my tasks)?
Thank you!
For an application of this size, I see no problem with this at all. Except what does "time" column in usercount tables refer to ?
What I have is about 130GB of time varying state data of several thousand financial instruments' orderbooks.
The csv files I have contain a row per each change in the orderbook state (due to an executed trade, inserted order etc.). The state is described as: a few fields of general orderbook information (e.g. isin code of the instrument), a few fields of information about the state change (such as orderType, time) and finally the buy and sell levels of the current state. There are up to 20 levels (Buy level 1 representing the best buy price, sell level one representing the best sell price and so on.) of both sell and buy orders, and each of them consist of 3 fields (price, aggregated volume and order amount). Finally there is additional 3 field of aggregated data of the levels beyond 20 for both buy and sell side. This amounts to total maximum of 21*2*3 = 126 fields of the levels data per state.
The problem is that since there rarely exists anywhere near 20 levels it doesn't seem to make sense to reserve fields for each of them. E.g. I'd have a rows where there are 3 buy levels and the rest of the fields are empty. On the other hand the same orderbook can have 7 buy levels after a few moments.
I will definitely normalize the general orderbook info into it's own table, but I don't know how to handle the levels efficiently.
Any help would be much appreciated.
I have had to deal with exactly this structure of data, at one point in time. One important question is how the data will be used. If you are only looking for the best bid and ask price at any given time, then the levels do not make much of difference. If you are analyzing market depth, then the levels can be important.
For the volume of data you are using, other considerations such as indexing and partitioning may be more important. If the data you need for a particular query fits into memory, then it doesn't matter how large the overall table is.
My advice is to keep the different levels in the same record. Then, you can use page compression (depending on your storage engine) to eliminate most of the space reserved for the empty values. SQL Server does this automatically, so it was a no-brainer to put the levels in a single record.
A compromise solution, if page compression does not work, is to store a fixed number of levels. Five levels would typically be populated, so you wouldn't have the problem of wasted space on empty fields. And, that number of levels may be sufficient for almost all usage.