Socrata - SoQL - Query for recent data - json

I'm trying to pull data from a variety of Socrata datasets in to a analytic architecture using the REST json API. I would like to find a way to get the new data dropping into the data set since the last request.
My plan at the moment is to use $order and $where with one of the date fields, and then pull a filtered set since the last day every 24-hours.
Are there any examples of ways to use some data math, or is there a better way that I'm missing to get the newest data since the last query?
Your help is appreciated.

Related

How to Filter Data in a Single MySQL Database Field that has Multiple Entries

On our Wordpress site, we use a plugin called s2member and it stores the levels (roles) of our clients as well as the times they were assigned a specific level in our database. I would like to create a table that shows when a user was assigned a specific level. I'm having a challenge getting the data I need because of the way the data is stored in the field. It stores all of the levels along with the associated dates and times when a user's level was changed in one field. In addition, it stores all of the times as Unix timestamps. Here's an example of a typical field associated with a client:
a:20:{s:15:"1562695223.0001";s:6:"level0";s:15:"1562695223.0002";s:6:"level1";s:15:"1562695223.0003";s:6:"level2";s:15:"1562695223.0004";s:6:"level3";s:15:"1577906312.0001";s:11:"ccap_prepay";s:15:"1596575898.0001";s:12:"-ccap_prepay";s:15:"1596575898.0002";s:13:"ccap_graduate";s:15:"1596575898.0003";s:11:"ccap_prepay";s:15:"1596575898.0004";s:7:"-level3";s:15:"1597196952.0001";s:14:"-ccap_graduate";s:15:"1597196952.0002";s:12:"-ccap_prepay";s:15:"1597196952.0003";s:13:"ccap_graduate";s:15:"1597196952.0004";s:11:"ccap_prepay";s:15:"1598382433.0001";s:14:"-ccap_graduate";s:15:"1598382433.0002";s:12:"-ccap_prepay";s:15:"1598382433.0003";s:11:"ccap_prepay";s:15:"1598382433.0004";s:6:"level3";s:15:"1605290551.0001";s:12:"-ccap_prepay";s:15:"1605290551.0002";s:11:"ccap_prepay";s:15:"1605290551.0003";s:13:"ccap_graduate";}
There are four columns in this table: umeta_id; user_id; meta_key; meta_value. The data above is stored in the meta_value column.
You'll notice that it also has multiple ccap_* entries. CCAP stands for custom capapability and I would like to be able to chart those assignments and associated times as well.
Do you have any idea how I can accomplish this?
Thank you for any help you can give.
I talked to an engineer about this and he told me that I would need to learn Python and I believe he said I would need to learn how to also use Pandas and Numpy to extract the data I need but he wasn't exactly sure. I started taking a data analyst course on Coursera but I still haven't learned what I need to learn and it's already been several months. It would be great if someone could provide a solution that I could implement more quickly and use on an ongoing basis.
If there's a way to accomplish my goal by exporting this table to a CSV file and using Microsoft Excel or Google Sheets, I'm open to that too.
Here's an image of the table (if it helps):
Database table
Here's an example of my desired output:
Desired output
In my desired output, I used Excel and created a column that converts the Unix timestamp to a short date and another column where I used a nested IF statement to convert the CCAP or level to its meaning that we understand internally.

Is it possible to use hbase to store json documents and query with phoenix?

I am trying to figure out a solution for our data problem. Basically, we:
have event data on a per-user basis that streams into our system
but we want to be able to aggregate several users together when it is clear that they are the same person (so we propose to store the event data in hbase so that we can delete and update rows)
the data is in the form of json documents
we would like to be able to run sql-like queries on the data, that, for example, retrieve all the rows whose json document has a key of 'page-visited' and a value of 'homepage'. In other words we want to be able to build queries that are looking at the individual keys and value of the json documents.
I am trying to figure out if it is possible to:
store this data in hbase (I think it should be possible/easy)
query it with phoenix in some way (only just started looking at phoenix, but seems like it might be possible to define a column as of type 'json' and maybe it has some json functions - though I didn't find any yet)
Thanks for your help.

Does the Socrata SODA API support getting a list of dates on which the dataset was modified?

Does the Socrata SODA API support a method to query out all the dates a dataset has been updated? Basically a changelog for the dataset that has an object for every modification/update to a dataset.
There is an existing question that asks for the last modified date (you can get it through the "/data.json API available on all Socrata-powered sites".
There is also a method to get the modified dates of individual rows using System Fields and the :update_at field. But this is incomplete, a data provider might update every row each time. This means there is no guarantee that we are really getting back a history of modifications, just the top layer of modification on each row.
I'm looking for the complete list of modification dates, at least. We are trying to get a sense of activity on datasets and we need to know how often they are being updated.
Unfortunately, Max, we don't offer what you're looking for. We've got the last time the dataset and metadata were modified, but not a changelog of every single time that there was a change.
A surprisingly large number of datasets change very frequently, as often as every 5 minutes.

Using Socrata SODA API to query most recent rows by datetime

I am new to this site and this is my first question. I am trying to query the "Seattle Real Time Fire 911 Calls" database from the Socrata Seattle Open Data site: https://data.seattle.gov/Public-Safety/Seattle-Real-Time-Fire-911-Calls/kzjm-xkqj. I'm not an expert at using the SODA API, and I'm having difficulty figuring out how to query the most recent entries in the database. All attempts to use the "order" or "where" SoQL statements give me data from 2010 or 2011, and I cannot figure out how to query the most recent 300 entries. Querying the "top" rows yields the oldest entries. Using a full OData feed pull yields data as recent as today, but I need to use a fast json or csv SODA API query.
Note: The datetime field does not respond to any "where" statements that I use.
Thank you!
OK, a few tips to get started:
The $order parameter sorts by default in ascending (ASC) order, so you'll want to actually order by datetime DESC to get the latest records first
Unfortunately Seattle has a number of crimes that are listed with no datetime, so you'll also want to filter with a $where query to only retrieve results in a date range. $where=datetime > '2014-07-01' works for me, for example
To only get the top 300 results, you'll want to pass a $limit=300 parameter as well.
Here's a sample request in Runscope for you to try out.

Google Refine and fetching data from freebase for a large data set to create a column from URL not working

I have a google refine project with 36k rows of data. I would like to add another column with fetching json data from freebase url. I was able to get it working on a small dataset but when i ran it on this project it took few hours to process and then most of the results were blank. I did get some results with data though. Is there a way to limit on amount of rows the data will be fetched or a better way of getting the data from the url.
Thank You!
If you're adding data from Freebase, you'd probably be better off using the "Add column from Freebase" rather than "Add column by fetching URL."
Facets are one of the most powerful Google Refine features and they can be used to control all kinds of things. In this case, you could use a facet to select a subset of your data and only do the fetch on that subset (and then repeat with a different subset).
The next version of Refine will include better error reporting on the results of URL fetches to help debug problems like this, but make sure that you're respecting all the limits of the remote site as far as total number of requests, requests per second, etc.