I have three different Stata files (each for three different years) and I want to estimate a fixed effects regression. My guess is that I need to merge those files in order to test my regression, but how do I do it? How do I give the time identification for the same variable in each of these files?
Typically, you don't merge (put the files side by side) such files, but append (put them on top of one another) them. Typically, the year or wave variable is already included, but when that is not the case you need to generate them before you merge the files. For more, just type in Stata help merge, help append, and help generate.
Preparing datasets should be exactly documented, so using the GUI is not the way to do this. Instead, you should do this using a .do file. For a good introduction on how to do good and reproducible research with Stata, see:
Long, J. S. (2009). The workflow of data analysis using Stata. College Station, TX: Stata Press.
Related
I created various flowcharts of processes with the latest visio of a small, complex company (not done yet, but there will be approximately 8 visio files, each with 3-6 sheets).
I am currently looking for a way to present the final result, my idea is to save those files as a website (VML). The problem however is, that I want one single file: therefore my question, how can I merge those files?
I tried to use my very limited html knowledge, the site didn't open anymore. I tried to use "Microsoft Expression Web 4" and just copied 2 test files in there, but it was not usable. My goal is to have on the left side contents, which are linked to the actual visio drawings (think: visio file1 - sheets1.1-1.5; visio file2 - sheets2.1-2.3,...)
Thanks a lot for any help (or other ideas), I am going crazy over this!
How much easier would it be to merge the drawings in Visio itself, before exporting to HTML. Just open the files side-by-side and drag the pages from one document to the other. You may need to press the control key to avoid moving the shapes instead of copying them.
Can machine learning be used to transform/modifiy a list of numbers.
I have many pairs of binary files read from vehicle ECUs, an original or stock file before the vehicle was tuned, and a modified file which has the engine parameters altered. The files are basically lists of little or big endian 16 bit numbers.
I was wondering if it is at all possible to feed these pairs of files into machine learning, and for it to take a new stock file and attempt to transform or tune that stock file.
I would appreciate it if somebody could tell me if this is something which is at all possible. All of the examples I've found appear to make decisions on data rather than do any sort of a transformation.
Also I'm hoping to use azure for this.
We would need more information about your specific problem to answer. But, supervised machine learning can take data with a lot of inputs (like your stock file, perhaps) and an output (say a tuned value), and learn the correlations between those inputs and output, and then be able to predict the output for new inputs. (In machine learning terminology, these inputs are called "features" and the output is called a "label".)
Now, within supervised machine learning, there is a category of algorithms called regression algorithms. Regression algorithms allow you to predict a number (sounds like what you want).
Now, the issue that I see, if I'm understanding your problem correctly, is that you have a whole list of values to tune. Two things:
Do those values depend on each other and influence each other? Do any other factors not included in your stock file affect how the numbers should be tuned? Those will need to be included as features in your model.
Regression algorithms predict a single value, so you would need to build a model for each of the values in your stock file that you want to tune.
For more information, you might want to check out Choosing an Azure Machine Learning Algorithm and How to choose algorithms for Microsoft Azure Machine Learning.
Again, I would need to know more about your data to make better suggestions, but I hope that helps.
As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!
I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?
You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.
Is it possible to display only significant P-values and/or R-values in the output of SPSS?
This would simplify output significantly and reduces the tables to display only the relevant parts (the ones I need).
I'm not sure that this is a good idea, but if you want to do things such as highlight significant coefficients in a regression or blank out nonsignificant correlations in a correlation matrix, the SPSSINC MODIFY OUTPUT extension command can do this. It is included in the Python Essentials for SPSS Statistics and can be downloaded from the SPSS Community site at www.ibm.com/developerworks/spssdevcentral or, for V21, from the same site where Statistics is kept for download or the trial site.
I agree that there are many cases where this is not a good idea.
In general, I find post-processing of SPSS output tables to be a little bit awkward. This is one area in which R is a lot easier to use.
For occasional analyses I often find it useful to paste an SPSS output table into Excel for further processing. For example, you could sort columns by size (e.g., mean difference, p-value, r etc.), calculate new values (e.g., mean differences, absolute correlation, etc.), make table easier to read and so on.
I'm looking to identify some possible software options that will allow for custom rules to manipulate bulk data files (.csv) For example, proper capitalization (allowing for states to remain capital and unique surnames), identifying the word count of specific words in a field, and some other custom rules. Any guidance would be appreciated.
You could use Talend Open Studio for this task. It is an Opensource ETL tool for data manipulation and integration. You can for example ImportCSV >> DATABASE >> perform transformations >> ExportCSV. The possibilities are endless.
You can find it here: http://www.talend.com/products-data-integration/talend-open-studio.php
It also sounds like you might be looking to create a profile of the data. For this you can use Talend Open Profiler, they recently added support for flat files such as your .csv. It is simple to use and you should be up and running in 30 mins.
You can find the download here: http://www.talend.com/products-data-quality/talend-open-profiler.php
You can find some tutorials here:http://www.talendforge.org/tutorials/menu.php
On the tutorials choose the Data Quality tab, and scroll down until 'Talend Open Profiler'
It is my first step in assessing data quality on a new dataset.
A quick google "data scrubbing utilities" turned up this:
http://data-scrubbing.qarchive.org/
They look to be very close to what you're looking for.
It'll really depend on how complex the rules get. Much more complex than simple stuff, and you'd probably be ahead by just coding something up (or having it coded).