What to analyze and mine in news articles? - data-analysis

I am trying to learn some text analysis and intend to apply the same in news articles which i download and persist in an relational database against the url,keywords,start date, end date,headline and description.
Except that I don't know what to analyze and what are the practical use cases of such an analysis ?

Related

How should I store similar entities - in one table or several?

I am creating a CV website, but in difference to most I am trying to make it with database. I mean that usually such websites are static and all of the information is hard coded in the HTML. Since I am back-end developer I like to make it so everything including buttons and welcome messages are taken from the database. I am trying to store projects that I have worked on. There are several types:
Github Repository - a project that is done purely on github.
Work related - a project I have done on work and there is no github repository of it, only link to view the final result
UpWork or other freelance website - as a freelancer I have projects to fix something on a website and those projects can be viewed only on my profile there and I would like to list them with link to UpWork or wherever there is information on what exactly I was hired to do.
Now my question is - should I have different Entities and therefore different tables for these types of projects or should I have all of the possible properties in one table. For example if it is Github there is repository field and if it is work related then there is company field. If it is freelance it has link to the website I was hired on. Also there are different sub-types - web applications, desktop applications, games and so on.
As you can guess the changes are small (1 or 2 properties). I could very easily leave empty some properties and have another property projectType, but is this the right way? Should I have different tables and entities for them?
To give some info - I can work with both MySQL and NoSQL and I havent decided yet on which one should my website be made on. I am currently thinking about NoSQL. This means I am asking on how to store the projects on MySQL and NoSQL (by NoSQL I mean MongoDB). If it helps the languages I am choosing from are PHP (MySQL) and JavaScript (NoSQL)
I know that usually questions without code are downvoted, but this is more of a logic based problem as I know how to do it, but I don't know the best practices for my situation. This being said here is a small code for you -
console.log('Thank you in advance')
MongoDB lends itself very well to this exact situation.
You can create a collection where documents leave out certain fields if they are not needed for that type. The querying parameters of MongoDB allow you to check $exists on fields if you need to, and documents are stored efficiently, only taking up memory where a field is needed.
You can even setup a sparse index which is not required for every document. As long as your core document structure is the same, it is a good idea to keep them in one collection, and vary them based on their type.

How do I set up the architecture for a "big data" analysis project?

A friend of mine and I are in our senior year and will be starting a senior project soon. We had the idea to do a data analysis and data visualization project for it. Our project involves reading a CSV file that is updated every 2 minutes, parsing that data, then storing it in a database. Once that data is stored we want to run some analysis on it and provide an API through which we could access that data to visualize in some way. Our end goal would be to build an Android app that displays some of the raw data from the CSV and the analysis in a user friendly format. I talked to another CS Major and he explained that I would need a few different servers to accomplish this: One for the storage, another for analysis, and another for some type of queue that would make sure things don't get screwy while we are doing scraping and analysis. The problem is, I don't really know where to start with this. I've done some work with a SQL database before and a PHP front end, but nothing with multiple servers. I've heard of tools to use with big data projects like Hadoop but i'm not exactly sure where it fits in. If someone could point me to a resource of some kind to explain, or explain themselves, how I would start to structure this kind of project, that would be awesome!
Since you don't have much experience with these things you'll probably want to look at projects like Cloudera. Specifically their resources page has a nice set of videos and articles.
Another source of solid information (that I personally use) is by clicking on an Stack Overflow tag and selecting the votes option. Many good questions on a plethora of big data topics already exists.

How to convert dimensional DB model to datamining friendly layout?

My problem is, I have a Dimensional Model DB NFL league. So we have Players, Teams, Leagues as the dimension tables and Match as the factual table relates these tables. For instance, if I need to query stats of a player in a particular match or a range of matches, it is very painstaking SQL query with lots of joins to convert machine readable ID based tables to human readable name based version. In addition, analysis of that data is also very painful. For being a solution, I suggest to transform that DB to Analysis friendly version. Again for example, Player table ll include players at each row with related stats and same for Teams as well.
The question is, is there any framework, method or schema that might guide me to design the analysis friendly DB layout. Also still the use of SQL is favorable or any non-sql DB is better for this problem?
I know it sounds very general question but I just want to hear some expertise about the topic. Therefore, any help, suggestion is very welcome.
I was in a team faced with a similar situation about 13 years ago. We used a tool called "PowerPlay", a Business Intelligence tool from Cognos. This tool was very friendly to the data analysts, with drill down capabilities, and all sorts of name based searching.
If I recall correctly (it's been a while), The BI tool stored the data in its own format (a data cube) but it had its own tool for automatically discovering the structure of an SQL based data source. That automatic tool was really struggling with the OLTP database, which was SQL (Oracle) and which was a real mess... a terrible relational design.
So what I ended up doing was building a star schema to collect and organize the same data, but more compatible with a multidimensional view of the data. I then built the ETL stuff to load the star from the OLTP database. The BI tool cut through the star schema like a hot knife through butter.
And the analysts didn't have to mess with ID fields at all.
It sounds like your starting place is like the star schema I had to build. So I would suggest that there are BI tools out there that you can lay on top of your star and that will provide precisely the kind of analyst friendly environment you are looking for. Cognos is only one of many vendors of BI tools.
A few caveats: If you go this way, you have to make an effort to make sure your name fields "make sense" if they are going to provide meaningful guidance to the analysts trying to drill down or search. Sometimes original data sources treat name fields as more or less meaningless stuff, where errors don't matter much. The same goes for column names. Column names that DBAs like are often gibberish to data analysts. You may also have to flatten any hierarchical groupings in your dimension tables, but you may have already done this. It depends on what your BI tool needs.
Hope this helps, even if it's a little generic.

Best method to scrape large number of Wikipedia tables to MySQL database

What would be the best programmatic way to grab all the HTML tables of Wikipedia main article pages where the pages' titles match certain keywords? Then I would like to take the column names and table data and put them into a database.
Would also grab the URL and page name for attribution.
I don't need specifics just some recommended methods or links to some tutorials perhaps.
The easy approach to this is not to scrape the wikipedia website at all. All of the data, metadata, and associated media that form Wikipedia are available in structured formats; so preclude any need to scrape their web pages.
To get the data from Wikipedia into your database (which you may then search, slice and dice 'til your heart's content):
Download the data files.
Run the SQLize tool of your choice
Run mysqlimport
Drink a coffee.
The URL of the original article should be able to be re-constructed from the page title pretty easily.

full text search sql server (which stackoverflow turned down)

My application is a help (user assistance system) just like Online MSDN. but the only way to navigation is through SEARCH. Either the search is good or my system is dead.
I am looking for a third party search engine that can connect to database and provide
out of the box full text searching.
i have researched sql server 2008 ifts, lucene.net api, sql lite fts4 but all of them lack the ranking of result as good as google does.
em not expecting sth like google but i need best ranking search engine product.
Any suggestion or experience ?
maybe i should not go for third party search engine and use Lucene.NET or sql server 2008 FTS
but how can i establish good ranking for user provided Search query.. like
"how can i do upload excel file in XYZ interface" etc..
My short answer is discouraging: you won't be able to find do it yourself, even for an "okay" solution.
If you want good ranking:
Make your site friendly to search engines (which doesn't
necessarily mean that you have to open it to public, just make sure
search engines understand the URLs.)
Pay google to do it (look for google apps)
As you said, a search engine has to do two things at least. The first one is indexing, i.e., finding the documents out of the database based on queried keywords. The second is ranking, which sorts all documents and highlights the most relevant ones.
Ranking is one of the key factor of how good a search engine is. It's not surprising ranking is hard.
To give you an idea how hard it is, take the sentence in your question (i.e., "how can i do upload excel file in XYZ interface") for example. A search engine has to answer at least two questions to get good results:
Which keywords is most important? For example, XYZ might be more important than the word "how", and "can".
What's the possible meanings of the word? "Excel" can be microsoft excel, or Xcel energy(a company name excel)
There are a whole field in computer science dedicated to this problem. If you want some more evidences, take a quick look at ACM WWW.
One thing that is even more discouraging is that getting an "okay" solution would be difficult. The high level point is that the computer knows nothing about English, he has to read a lot to learn how to rank document.
Sadly, "a lot" means a lot of work -- For example, many textbooks suggest ranking documents based on TF/IDF, but getting a reasonable cut for these values requires crawling millions of web pages.
To summarize:
Ranking is hard.
Therefore it's not surprising that you won't be able to find any free, out-of-the-box solutions, and Google and Microsoft keep their ranking algorithms proprietary.
If you want to rank documents in a large database, get a search engine.
check out new feature for semantic search in sql server 2012:
http://msdn.microsoft.com/en-us/library/gg492075%28v=sql.110%29.aspx It won't be a silver bullet but might provides you a "out of the box" approach.