Web usage mining with rapidminer - rapidminer

I'm very much new to Rapid Miner and I'm currently doing a research on web usage mining. I want to analyze some apache and IIS web server logs and detect some fraudulent activities. I have googled and couldn't find some tutorials for this kind of web log file mining using Rapid Miner?
So my questions:
1) Is it possible to do this with Rapid Miner(As I know it has a web mining extension)
2) Can somebody please advice me how to do this?some tutorials etc.
Thanks very much in advance.

This was asked in the rapidforum:
yes, the RapidMiner 4.6 Community Edition together with its text mining plugin are suitable for web usage mining. The RapidMiner 4.6 operator LogFileSource allows to directly import web server log files. RapidMiner supports aggregations of web usage statistics, automated web page visitor session extraction, search robot filtering, mash-ups with web services to map ip addresses to countries, cities, and map coordinates, automated clustering of visits and/or click paths, frequent path item set mining and association rule generation, 2D and 3D visualization of web usage statistics, click path sequence analysis, personalized product recommendations for cross-selling, and many other things.

Related

Considering Tyk API Gateway - open source version

Project background: Building an API driven Learning Management System. The back-end system will be receiving data from multiple systems and interfaces: web, mobile, VR.
Looking at API Gateways to front our APIs. Preferably an Open Source API gateway but need to be sure that the support and service is available. Tried out Tyk.io and it feels like it might be the way to go. Been reading other StackOverflow threads around this and looks like TYK's gateway fairs quite well against the likes of Kong and WSO2.
Main areas of consideration for us are:
Rate-limiting
Open ID Connect authentication
Analytics
Scalability
Hybrid model of hosting - combination of on-prem and cloud depending on compliance requirements of educational institutes (Probably rules of AWS' gateway)
It would be really helpful if anyone who is using or has used TYK.io for their production projects can share their experience, especially for enterprise clients/projects.
Full disclosure: I work for Tyk, so of course think that Tyk is the best fit for your project ;)
Seriously, though - Tyk can do all those things you’re after. Here are some links to the documentation for each item that is big on your list:
Rate-limiting
Open ID Connect authentication
Analytics
Scalability
Hybrid model of hosting
You can also post on the Tyk community for help, if you haven’t already, or search to see what else others have said.
The Tyk Open Source API Gateway will do everything you need, even outputting analytics to difference sources, like ElasticSearch, Mongo or just CSV.
In addition, you can also use our API Management Platform to control your open source gateway. The Tyk API Management platform includes a Dashboard with analytics and out-of-the-box developer portal. Tyk is free to use, under a developer license, to manage a single gateway node, ideal if you are doing a POC.
Hope this helps and please keep in touch to let us know more about your use case.

Additional Tutorials or worked examples of best practice for configuring multi vm projects in google compute engine

I was hoping people would know of more samples and best practice guides for configuring systems on google compute engine so I can gain more experience in deploying them and apply the knowledge to my own projects.
I had a look at https://developers.google.com/compute/docs/samples-and-videos#samples which runs through deploying cassendra cluster and hadoop using scripts but I was hoping there might be more available including on the following topics
Load balancing webservers across zones samples including configuring networking,
firewalls and load balancer.
Fronting tomcat servers with apache behind a load balancer
Multi network systems in compute engine using subnetting
Multi project systems and how to structure them for reliability and secure interoperability.
They would be easy to follow projects you build starting from a blank project and end up with a sample site running across multiple vm's & zones with recommended security in place, a bit like the videos you see for gae coding examples that go from hello world to something more complex but for infrastructure not code.
Does anyone know of any?
You may want to checkout https://cloud.google.com/developers/#resources for tutorial and samples as well as http://googlecloudplatform.github.io
I'm new to the forums so I can only post two links. Taking a quick look I see several topics that may be of interest to you:
Managing Hadoop Clusters on Compute Engine
Auto Scaling on the Google Cloud Platform
Apache Hadoop, Hive, and Pig on Google Compute Engine
Compute Engine Load Balancing in Action
I hope this helps!

Is BizTalk The Correct Solution?

We have about about six systems (they are all internal systems) that we need to send data between. Currently we do not have a consistent way of doing this. We use SSIS, SQL Server linked servers to directly update databases, ODBC connections to directly update databases, text files, etc..
Our goals are:
1) Have a consistent way of connecting applications.
2) Have a central way of monitoring and logging the connections between
applications.
3) For the applications that offer web services we
would like to start using them instead of connectiong directly with
the database.
Whatever we use will need to be able to connect to web services, databases, flat files, and should also be able to accept data via a tcp connection.
Is Biztalk a good solution for this, or is it is overkill?
It really depends. For the architecture you're describing, it would seem a good fit. However, you will need to validate wether biztalk can communicate whith the systems you are trying to integrate. For example; when these systems use webservices, message queues or file based communication, that may be a good fit.
When you start with biztalk, you have to be willing to invest in hardware, software, en most of all in learning to use it.
regarding your points:
1) yes, if you make sure to encapsulate the system connectors correctly
2) yes, biztalk supports this with BAM
3) yes, that would match perfectly
From what you've described (6 systems), it is definitely a good time to investigate a more formalized approach to integration, as you've no doubt found that in a point to point / direct integration approach will result in a large number of permutations / spaghetti as each new system is added.
BizTalk supports both hub and spoke, and bus type topologies (with the ESB toolkit), either of which will reduce the number of interconnects between your systems.
To add to oɔɯǝɹ:
Yes - ultimately BizTalk converts everything to XML internally and you will use either visual maps or xslt to transform between message types.
Yes. Out of the box there are a lot of WMI and Perfmon counters you can use, plus BizTalk has a SCOM management pack to monitor BizTalk's Health. For you apps, BAM (either TPE for simple monitoring, but more advanced stuff can be done with the BAM API).
Yes - BizTalk supports all the common WCF binding types, and basic SOAP web services. BizTalk's messagebox can be used as a pub / sub engine which can allow you to 'hook' other processes into messages at a later stage.
Some caveats:
. BizTalk should be used for messages (e.g. Electronic Documents across the organisation), but not for bulk data synchronisation. SSIS is a better bet for really large data transfers / data migration / data synchronisation patterns.
. As David points out, there is a steep learning curve to BizTalk and the tool itself isn't free (requiring SQL and BizTalk licenses, and usually you will want to use a monitoring tool like SCOM as well.). To fast track this, you would need to send devs on BizTalk training, or bring in a BizTalk consultant.
. Microsoft seem to be focusing on Azure Service Bus, and there is speculation that BizTalk is going merged into Azure Service Bus at some point in future. If your enterprise strategy isn't entirely Microsoft, you might also want to consider products like NServiceBus and FUSE for an ESB.
You problem is a typical enterprise problem. Companies start of building isolated applications like HR, Web, Supply Chain, Inventory, Client management etc over number of years and once they reach a point these application cannot be living alone and they need to talk to each other, typically they start some hacked solution like data migration at database level.
But very soon they realize the problems like no clear visibility, poor management, no standards etc and they create a real spaghetti. The biggest threat is applications will become dependant on one another and you lose your agility to change anything. Any change to system will require heavy testing and long release cycle.
This is the kind of problem a middleware platform like BizTalk Server will solve for you. Lot of replies in the thread focused on cost of BizTalk server (some of the cost mentioned are not correct BTW). It's not a cheap product, but if you look at the role it play in your organisation as a central middleware platform connecting all the applications together and number of non-functional benefits you get out of the box like adapters to most of the third party products like SAP, Oracle, FTP, FILE, Web Services, etc, ability to scale your platform easily, performance, long running work flows, durability, compensation logic for long running workflows, throttling your environment etc., soon the cost factor will diminish.
My recommendation will be take a look at BizTalk, if you are new then engage with local Microsoft office. Either they can help or recommend a parter who can come and analyse your situation.

Arcgis explained? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I want to look into ArcGis, and I cant get my head around where it fits in.
I have used the Google Maps API to create some simple maps with makers, Overlays, Listeners etc.
I have recently started looking at PostGIS, and i fully understand that, it enhances Postgres with additional data types and functions for drawing polygons and mapping areas. Great!
What I dont understand is where ArcGIS fits in?
What does it do? why would you use it.
I have a large db of addresses.
Ultimately, it comes down to whether you are happier having a big software stack where everything is designed to work together or whether you are happier doing a bit of SQL, and/or Javascript and Python coding (these are the big players in open source GIS), and generally piecing bits together yourself. ESRI, the makers of the ArcGIS family (which includes desktop, server and web based technologies) is essentially the Microsoft of the GIS world -- the big player, whose products are designed to work very well with each other, but are sometimes a bit tardy when it comes to standards compliance or interaction with 3rd party software.
On the open source side, Postgis, which essentially provides a spatial type extension to Postgres plus many spatial functions, is really an amalgam of various packages: GEOS, which provides many of the spatial predicate functions, Proj4j which does coordinate system conversion and GDAL which provides a lot of glue functions. Recently Postgis has added native support for raster, 3d and topology functions, along with the long existing vector functions, which means that amazingly sophisticated GIS analysis can be performed directly at the database level by chaining together SQL functions.
As has been suggested above QuantumGIS (generally known as QGIS) provides most of the functionality of ArcGIS desktop, for those wanting to go that route. Javascript libraries, such as the OpenLayers or Leaflet (there are many others) can be used to visualize the results of Postgis queries. In addition, there are tools such as Geoserver (Java serlvet based) which allow you to serve up data held either in Postgres/Postgis tables or ESRI shp files in common formats such as WMS, WFS and WMTS, acting as a bridge between client and server.
The final decision is often as much political as technical. If you work for a large utility company or local government, for example, support contracts and industry norms are likely to outweight budgetary constraints. If you are working for a small startup and have people who are happy working at the command line, you will likely get much better value for money from the open source stack.
I assume you are asking about ArcMap which is ESRI's desktop GIS application. The application is an extremely powerful cartographic production and analytic tool that allows the user to perform complex spatial operations. ArcMap also provides a graphical user interface that allows users to edit geometry supporting complex typologies that model real world phenomenon. The application also supports a wide range of raster analysis that is commonly used for remote sensing.
Some examples of how you could use ArcGIS with your database of addresses:
You can use it to compare addresses to census data and better understand customers.
Generate heat maps of statistically significant clusters.
Use network analysis tools to identify closest facilities or route you to the addresses in the most efficient order.
Find what types of features your data falls within (ie City council districts, fire districts, states, etc) and update your data with that information.
These are just a few things you could do with ArcGIS. In short, this tool allows you to view and analyze your data in a spatial context versus the relational approach you seem to be taking now.
ESRI's ArcGIS is very powerful and has TONS of customization options through their ArcObjects API as well as a new way to add your own custom tools and button commands through a framework they call Add-ins. You can even use Python to create a very simple (code-wise) tool that lets a user click a button and, for example, return a selection on the map of all the telephone poles that are within 50 feet of a tree line. They could then just export this set of pole features as a tabular data report for a tree trimming crew to visit each pole to see if they need to trim back the vegetation. You can also use their ArcGIS Runtime to build a completely custom tool that runs from a USB thumb drive with zero install with only the parts and pieces you create like a Map, Table of contents, and a custom toolbar that has only the buttons and tools you need specific for that application. I've seen a gas utility inspection application written this way that only had the map and three buttons for them to use on an iPad or Android tablet. The options with ArcGIS are very near to endless and they keep updating it all constantly.
My day job is customizing ArcGIS to fit gas, water, and power utilities' needs to match their business workflow. I have been working with ArcGIS since 2004.
If you understand google maps API and postGIS then you really have no need for ArcGIS. Download QGIS and use it in conjunction with POSTGIS http://www.qgis.org/
ARCGIS is just unnecessary that's why it doesn't make sense to you (and rightfully so).
Simple Answer: It lets you use MAPS to analyze and store data in a database, where said data has some sort of 'location' attribute on a surface or in 3D space.
Here's a quick example: "Return all of the parcels in Smith County that are within 1000 feet of a school and display them in red on a map."
If I had to answer in one sentence it would be like: if you just want to show where is something on the map (and some basic data with it) use Google Maps API but if you want to analyze, query and understand your spatial data use ArcGIS.
ArcGIS is platform, containing Desktop, Server, Portal (spatial CMS) with various types of geodatabases supported. ArcGIS for Desktop is used for powerful spatial analysis, it includes more than 700 different tools that support strong spatial and alpha – numerical analysis. When we are talking about spatial analysis we can talk about different spatial overlays (simple example: where do wolves and foxes live), proximity analysis (factory to customer distances; protected area (buffer) around oil drill) and spatial statistics (finding patterns in space (and time), mapping clusters (hot/cold spots) and also, since database is in the background of every serious GIS you can use SQL to query your alpha-numerical (and spatial) data to make better decisions.
Mapping is also function of ArcGIS Desktop software – our brains can understand much better information when they are visualized, and also you can and should visualize results you obtained through analysis. Keep in mad that map is only visualization of the data in geodatabase (or shapefile).
ArcGIS Desktop is also used for data entry – with “heads up” editing, for example form orto-photo images for creating vectors with attributes.
Geodatabase management is also part of ArcGIS and geodatabases vary from file geodatabases to enterprise geodatabases which use SQL Server, Oracle, DB2 and other RDBMS systems. Single user file geodatabase supports one concurrent editor and has no storage limit, while enterprise databases provide multiuser editing, versioning, archiving and backup scenarios. Personal geodatabase is single user geodatabase using Microsoft Access for storing spatial data.
ArcGIS for Server provides different formats of spatial services containing spatial data (map) along with alpha-numerical information (if supported by format). Types of ArcGIS for Server services that can be published are: Mapping, WCS, WMS, Feature Access, Schematics, Mobile Data Access, Network Analysis, KML, WFS… ArcGIS for Server services are authored using ArcMap, served with Server (of course) and their URL links are used by developers who code from the scratch or used within Portal for ArcGIS web app templates which can be customized by developers if needed, or other, Silverlight of Flex Esri viewers.
I would say that if you are already comfortable with PostGIS, you should be fine for any work with vectors. If you are working with raster data then I think that would be where ArcGIS would fit in. In ArcGIS you can run different types of statistics and filters on rasters where I don't think you can with PostGIS but I'm sure that will eventually be added.
One more thing, if you ever need to automate your PostGIS work, I would recommend using Python with the psycopg2 library.

Should I code for browser or PC? (fleet management)

I have to architect a commercial vehicle fleet tracking system.
Each vehicle (a few 100, max a few 1,000) will have a GPS and satellite transmitter and will periodically report its position. Positions will be stored in a database and used to create a Google Map.
There will of course be other functionalities. Security, log in, etc and probably lots of interaction with other corporate databses (drivers start/stop time for salary purposes, etc).
Question: pure GoogleMaps is probably best implemented as a browser based app (Php & MySql?), but with the additional functionality of a commercial vehicle fleet tracking system, would it be better doing something PC based (Windows/Linux)?
Any other advice? Thanks
I think with the capabilities of modern browsers, along with various mature client-side frameworks, we are witnessing an always thinning distinction between web and desktop interfaces.
You may want to take into consideration that a web application automatically solves some important problems for you:
Distribution: No need to distribute your application. Simply provide a URL.
Updates: Upgrading and fixing problems in your software will be easier and quicker if you distribute it through a web interface.
Security: Deriving from the above, you are able to fix security vulnerabilities more promptly.
Compatibility: Your application will be able to work on any operating system that can launch a web browser.
Last but not least, remember that the Google Maps API is not free for this type of application. Article 10.9.C of Google Maps API Terms and Conditions explicitly restrict using the standard Google Maps API for fleet management and asset tracking. You would need the Google Maps API Premier to legally use Google Maps for your application.
According to one unofficial source (dated April 2008), this would cost USD 10,000 per year, which entitles you to track 100 vehicles. If you exceed the 100 vehicles, you would need to add USD 24 per additional vehicle per year.
Implement solution for the domain problems first. It means data storage, data transmission between vehicles and your system, methods of data analysis, aggregation and visualisation.
These will likely to sit as a head-less system on a server and provide access to it remotely, in both directions: to input data and to query data.
Now, PC or Web is more related to presentation on a client side. You can make both if you like. Web client as well as desktop application can serve as a client to remote data and operational server.
Don't forget that you can always host a web control in a thick client app. This is actually trivial with .Net on the Windows platform with the IE control. You can also access the browser's DOM this way and do some neat things. So just because there's a strong web component to what you're doing you're not necessarily "stuck" writing a pure web app.
One big question is what kind of hardware you'll be able to put in the vehicles. Will they be laptops or small PCs with full fledged OSs or something more mobile like CE or a pared-down Linux distro?
Google Maps is JavaScript based so you can do most things with it, e.g browser based, widgets, etc. However due to the licensing Google won't allow you to use it in anything other than an Internet environment unless you use there Enterprise License.
In terms of integrating it into other systems, its really difficult to say what's best without knowing what other software you are using, what protocols they use, are web services available, etc. I agree with Daniel though in that any distributed system not implemented in a browser better have some good reasons not to, simply because the benefits are substantial. You'll need to weight them up though with a full break down of all the different systems you will need to interact with and work out what fits best.
The great thing is that with it being JavaScript based you have a lot of flexibility in what you can do with it.
This is more an extension to #Daniel Vassallo's answer. Although a web based application would solve most problems there may be the small potential issue of bandwidth usage and reception for internet access. This may or may not be an issue for the fleet management, depending on how that is tackled on the hardware side of things.
An offline solution may assist with this issue but then a clever architect could find a way to create an initial web based solution which can be accessed with an offline application which can pick up the slack and/or provide predictive reasoning until a connection is re-established.