LSTM for predicting multiple sequences at the same time - deep-learning

I'm currently working on a project regarding a dataset that contains smartphone usage data from roundabout 200 users over a period of 4 months. For each user, I have a dataframe consisting of app-log events (Name of the App, Time, Location etc.). My goal is to predict the dwell time for the next app a user is going to open. I don't want to build one model for each user, but instead, I'm trying to build a model for all combined users. Now I'm struggling with finding an architecture that is suitable for this project.
The records are not evenly spaced in time, and the length of each dataframe differs. I want to utilize the temporal dependencies while simultaneously learn from multiple users at once, thus my input would be multiple parallel sequences of app usage durations with additional features and my output again multiple parallel sequences containing the dwell-time for the next app, but as the sequences are not evenly spaced in time nor have the same length it seems not suitable. I just wanted to get some ideas on how to structure the data properly and what you think would be a suitable approach. I would really appreciate some ideas or reading recommendations.

Related

Workflow for getting data collected in the field into relational database?

Context: I'm a field ecology grad student. In our lab we often collect data in the field in a notebook and, when we return from the field, we need to get it into digital form. The standard in our lab is to enter everything in a flat Excel table, but this is error prone and the data are hard to work with after. There is some interest in our lab for migrating to MySQL to store data and the main issue is getting the data from a flat format in a notebook into a corresponding set of MySQL tables. I've been tasked with teaching the lab how to do this because I have some SQL experience already.
Simplified Example: We live trap for small mammals of various species at various sites and take measurements. We typically have on the order of 100 measurements at a time. We have 3 SQL tables for the data: one corresponding to site information (e.g. habitat, lat/long, etc.), one corresponding to species information (taxonomy, dental formula, etc.) and one storing the measurements of individual animals (weight, length, etc.).
The solution needs to be fairly easy to implement for people who aren't super technical. For example, creating a web form is probably too complicated, though if there's an easy way to do this I'd be open to it. Also, it should be possible to enter large amounts of data at once. Finally, it should be pretty easy to set up for new datasets because we're often designing new studies and needing to collect different types of data.

Information Architecture & Retreival - Prioritizing queries

I've got an application that displays data to the user based on a relevancy score. There are 5 to 7 different types of information that I can display (e.g. Users Tags, Friends Tags, Recommended Tags, Popular Tags, etc.) Each information type would be a separate sql query.
Then I have an algorithm that ranks each type by how relevant it is. The algorithm is based on several factors including how long its been since an action was taken on a particular type, how important one information type is to another, how often one type has been shown, etc.
Once they are ranked, I show them to the users in a feed, similar to Facebook.
My question is a simple one. I need the data before I can run it through the ranking algorithm, so whats the most efficient way to pull only the data I need from the database.
Currently I pull the top 5 instances of each information type, and then rank those. Each piece of data gets a relevancy score, and if I don't have enough results that reach a certain relevancy threshold, I go back to the database for the next 5 of each.
The problem with this approach is that I risk pulling too many of one story type that I never use, and I have to keep going back to the database if I don't get what I need the first time.
I have thought about a massive sql query that incorporates all info types & algorithm, which could work, but that would really be a huge query, and them I'm having mysql do so much processing, and I'm of the general mind set that Mysql should do the data retrieval and my programming language (php) should do the processing stuff.
There has to be a better way! I'm sure there is a scholarly article somewhere, but I haven't been able to find it.
Thanks Stack Overflow
im assuming that by information type you mean (Users Tags, Friends Tags etc); i would recommend rather than fetching your data again again n again against a particular fixed threshold, change your algorithm a little. Try assigning weights to each information type, even if you get a few records of a low priority type, you dont have to fetch it again.

Database solutions for a prospective (not retrospective) search

Let's say we have a requirement to create a system that consumes a high-volume, real-time data stream of documents, and that matches those documents against a set of user-defined search queries as those documents become available. This is a prospective, as opposed to a retrospective, search service. What would be an appropriate persistence solution?
Suppose that users want to see a live feed of documents that match their queries--think Google Alerts--and that the feed must display certain metadata for each document. Let's assume an indefinite lifespan for matches; i.e., the system will allow the user to see all of the matches for a query from the time when the particular query was created. So the metadata for each document that comes in the stream, and the associations between the document and the user queries that matched that document, must be persisted to a database.
Let's throw in another requirement, that users want to be able to facet on some of the metadata: e.g., the user wants to see only the matching documents for a particular query whose metadata field "result type" equals "blog," and wants a count of the number of blog matches.
Here are some hypothetical numbers:
200,000 new documents in the data stream every day.
-The metadata for every document is persisted.
1000 users with about 5 search queries each: about 5000 total user search queries.
-These queries are simple boolean queries.
-As each new document comes in, it is processed against all 5000 queries to see which queries are a match.
Each feed--one for each user query--is refreshed to the user every minute. In other words, for every feed, a query to the database for the most recent page of matches is performed every minute.
Speed in displaying the feed to the user is of paramount importance. Scalability and high availability are essential as well.
The relationship between users and queries is relational, as is the relationship between queries and matching documents, but the document metadata itself are just key-value pairs. So my initial thought was to keep the relational data in a relational DB like MySQL and the metadata in a NoSQL DB, but can the faceting requirement be achieved in a NoSQL DB? Also, constructing a feed would then require making a call to two separate data stores, which is additional complexity. Or perhaps shove everything into MySQL, but this would entail lots of joins and counts. If we store all the data as key-value pairs in some other kind of data store, again, how would we do the faceting? And there would be a ton of redundant metadata for documents that match more than one search query.
What kind of database(s) would be a good fit for this scenario? I'm aware of tools such as Twitter Storm and Yahoo's S4, which could be used to construct the overall architecture of such a system, but I'd like to focus on the database, given the data storage, volume, and query/faceting requirements.
First, I disagree with Ben. 200k new records per day compares with 86,400 seconds in a day, so we are talking about three records per second. This is not earth shattering, but it is a respectable clip for new data.
Second, I think this is a real problem that people face. I'm not going to be one that says that this forum is not appropriate for the topic.
I think the answer to the question has a lot to do with the complexity and type of user queries that are supported. If the queries consist of a bunch of binary predicates, for instance, then you can extract the particular rules from the document data and then readily apply the rules. If, on the other hand, the queries consist of complex scoring over the text of the documents, then you might need an inverted index paired with a scoring algorithm for each user query.
My approach to such a system would be to parse the queries into individual data elements that can be determined from each document (which I might call a "queries signature" since the results would contain all fields needed to satisfy the queries). This "queries signature" would be created each time a document was loaded, and it could then be used to satisfy the queries.
Adding a new query would require processing all the documents to assign new values. Given the volume of data, this might need to be more of a batch task.
Whether SQL is appropriate depends on the features that you need to extract from the data. This in turn depends on the nature of the user queries. It is possible that SQL is sufficient. On the other hand, you might need more sophisticated tools, especially if you are using text mining concepts for the queries.
Thinking about this, it sounds like an event-processing task, rather than a regular data processing operation, so it might be worth investigating Complex Event Processing systems - rather than building everything on a regular database, using a system which processes the queries on the incoming data as it streams into the system. There are commercial systems which can hit the speed & high-availability criteria, but I haven't researched the available OSS options (luckily, people on quora have done so).
Take a look at Elastic Search. It has a percolator feature that matches a document against registered queries.
http://www.elasticsearch.org/blog/2011/02/08/percolator.html

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

How to store logging data?

I have built and app which does random stuff and I want to collect various statistics which I want to display in graphs. My problem is I'm not sure how to store the data in a database except writing each log into new row which seems very inefficient.
Example Data (5 minute averages):
Number of Users
Number of Online Users
Number of Actions
What would be the best way to store this information? Do I need a separate table for thing that I'm logging or could they all go into one table?
Often you don't need the full resolution data kept for all time and you can re-process it periodically into lower resolution data to save space. For example you could store one day of full resolution (5 minute averages) but periodically re-average that data into 1 hour bins/1 day bins/ 1 month bins/ etc while also culling the data.
This allows you to have the data you need to display nice graphs of the activity over different time ranges (hour, day, week, month, etc) while limiting the number of rows to just what your application requires.
There are also excellent applications to store and display time-series data. MRTG and RRDTool come to mind. See this tutorial for good food for thought:
rrdtutorial