topic proportions in my corpus? - lda

Thanks for reading and taking the time to think about and respond to this.
I am using Gensim's wrapper for Mallet (ldamallet.py), and it works like a charm. I need to get the topic proportions for my corpus (over all my documents) and I do not know how to do that. model.alpha is not it as it is not normalized to 1. Plus, alpha contains my Dirichlet parameters, and not the topic proportions. Am I correct?
Any help is much appreciated.

When you call mallet from the command line, the switch --output-doc-topics gives you the topic composition (as a csv file).

Related

How to edit a big shapefile online?

I have a big shape file (about 1GB) and I need to display and update data from this file with
an online map (mapbox).
As I understand I need to convert this shapefile into some DB (Spatialite? PostGIS)
and then query it by map BBOX to display by parts.
What is the easiest way to do this? Thanks
Thats a huge topic and depends very much on the usecase. If you want to dive deeper into the topic I would recommend this articel. It describes the underlaying theory and gives a working example for a SQL-Query. For me, it was sufficient to get going with a viable prototype. I think Postgres (with it's Postgis extension) would be a good choice since the data can be directly served as mvt.
I found this site very useful for displaying GeoJson on the fly, it usually copes well with large data, failing that you can import it too. https://geojson.io

Converting CSV into GRIB2 data for mapping in Leaflet

TL;DR: I'm looking for some resources on generating GRIB2 data sets on the fly, ideally using in-house-generated wind data in a CSV format.
We have a bunch of data for a series of localized weather stations monitoring wind information around our city. They report in at ~2-3 minute intervals (far more frequent than standard weather data), and from their reports we have lat, lon, wind speed, and wind direction. Someone went and told the boss about these really slick visualizations, like this that can display wind speed and direction, and it's my job to make it happen.
The above plug-in for Leaflet, GitHub here, as well as several others, all use GRIB2 data, which from my research involves a left/right set of data and an up/down set of data for a series of points plotted out across a region.
The problem I'm having is that I've only found a handful of tools that interact with GRIB2 data, and most seem to decode data from the GRIB2 dataset, and only one tool running on Fortran seems to exist that compiles GRIB2 data together.
So, is there any way to generate GRIB2 data on the fly using proprietary data at 2-3 minute intervals?
I've gone through this resource on NOAA's website, which is where I found a few tools.
I know how frustrating it can be to work with GRIB and some of the other science/weather related formats. This may not be the best answer, but it might be your only answer as I find these types of questions to only gather dust because of the general lack of knowledge with the formats and tools.
From what I remember, CDO tools (link here) can do some magical things - but I am not that experienced with it. I do use it for converting satellite data to plain text and it's been an absolute lifesaver! So I will explain :
My suggestion was to first convert the CSV to netCDF. I had a link saved for this a long time ago, but never came to really needing it. (discussion here). Essentially, some python code should be able to do the conversion for you. There may be several ways to do this, but I have never looked into it beyond initial research.
Next, you should be able to convert .nc to .grib using CDO. I know it can do quite alot. Here is a discussion regarding this, so it must be able to be done.
I also see at this link where someone converts grib to netcdf, but you should be able to do it in reverse as well. I just don't know the exact commands. From the link :
As an example of use of CDO, converting from GRIB to netCDF can be as simple as
cdo -f nc copy file.grb file.nc
I would suspect its just the reverse but probably something like :
cdo -f grb file.nc file.grb
Hopefully you can put things together for it to work without being too hack-y.
You can do this in a simple python script using pandas , xarray and cfgrib
import pandas as pd
import cfgrib
data = pd.read_csv('your_csv_data.csv')
xarray_data = data.to_xarray()
cfgrib.to_grib(xarray_data, 'out2.grib')
Please note that you have to define grib specifications first before you store as grib data.

Early abandon discord searching in Time-Series using saxpy

I am trying to look for discords (the most unusual, least similar shape) in a data-set using time-series. I came across this function in the saxpy package that outputs the discord shape. However, the link above is the only documentation that I could find and the input parameters to the function haven't been explained very well there.
More specifically,
find_best_discord_brute_force(series, win_size, global_registry, z_threshold=0.01)
What do the parameters win_size, global_registry stand for?
Also, does the series parameter require me to input SAX words?
It would be great if someone could clear this up.
Thanks!
You should use the Matrix Profile instead. Is it faster and simpler, and there is free code, see this presentation
http://www.cs.ucr.edu/~eamonn/Matrix_Profile_Tutorial_Part1.pdf
series is a numpy array of the data whose discords you are looking for.
win_size is the size of the sliding window used by sax_via_window for calculating the words representing your array.
Sorry, not sure what global_registry refers to.
For further information and documentation, there's documentation on github: https://github.com/seninp/saxpy

Find similar blocks of texts

I'm trying to extract (long) repeating blocks of the same code, and have the regular flow of execution jump to/from those block when needed.
This is done over an educational assembly language called Hack.
How can I extract those blocks.
I understand it's an instance of longest-common-subsequence problem, but I don't see how to approach this.
I have seen, this, this, and a couple other - no help.
Any pointers, be they -- solution course, relevant reads, or even libraries -- are much appreciated.

Extracting 'useful' information out of sentences?

I am currently trying to understand sentences of this form:
The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.
I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.
What I am trying to do is to identify what the problem was so in this case, set-top box and whether the action that was taken resolved the problem so in this case, yes because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:
I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine
So in this case, the problem was with the car. The action taken did not resolve the problem because of the presence of the word suspect. And the potential problem could be with the engine.
I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.
Really the best you could hope for is a Naive Bayesian Classifier with a sufficiently large (probably more than you have) training set and be willing to tolerate a fair rate of false determinations.
Seeking the holy grail of NLP is bound to leave you somewhat unsatisfied.
Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.
If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.
Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.