right tech choice to work with Json data? - json

We are a data team working with our data producers to store and process infrastructure log data. Workers running at our client systems generate log data which are primarily in json format.
There is no defined structure to the json data as it depends on multiple factors like # of clusters run by the client, tokens generated etc. There is some definite structure to the top-level json elements that contain the metadata where the logs are generated. Actual data can go into multiple levels of nesting and varying key-value pairs.
I want to build a sytem to ingest these logs, parse them and present in a way where engineers and PMs(prod managers) can read the data for analytics usecases.
My initial plan is to setup a compute layer like Kinesis to the source and write parsing logic to store the outcome in s3. However, this would need prior knowledge of the json file itself.
I define a parser module to process the data based on the log type. For every incoming log, my compute(kinesis?) directs data processing to corresponding parser module and emit data into s3.
However, I am starting to explore if any different storage engine(elastic etc.) will fit my usecase better. I am wondering if anyone as run into such usecases and what did you find helpful in solving the problem

Related

What's stopping me from using a standalone JSON file instead of a local db?

I need to store data for a native mobile app I'm writing and I was wondering: 'why do I need to bother with DB setup when I can just read/write a JSON file?. All the interactions are basic and could most likely be parsed as JSON objects rather than queried.
what are the advantages?
DB are intent to work with standardized data or large data sets. If you know that there is only a few properties to read and it's not changing, JSON may be easier, but if you have a list of items, a DB can optimize the queries with index or ensure consistency through multiple tables

Connect Adobe Analytics to MYSQL

I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?
There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.

Process and Query big amount of large files in JSON Lines format

Which technology would be best to import large amount of large JSON Line format files (approx 2 GB per file).
I am thinking about Solr.
Once the data will be imported it will have to be query-able.
Which technology would you suggest to import and then query JSON line format data in a timely manner?
You can start prototyping with some scripting language you prefer, to read the lines, massage the format as needed to get valid Solr json and send it to Solr via HTTP. Would the faster to get going.
Longer term, SolrJ will allow you to get max perf (if you need to), as you can:
hit the leader replica in a Solrcloud environment directly
use multiple threads to ingest and send docs (you can also use multiple processes). Not that this is harder/impossible with all other technologies, but in some it is.
you have the full flexibility of using all SolrJ api

What is an appropriate tool for processing JSON Activity Streams for both BI and Hadoop?

I have a number of systems, most of which are capable of generating data using JSON Activity Streams[1] (or can be coerced into doing so), and I want to use this data for analytics.
I want to use both a traditional SQL datamart for OLAP use, and also to dump the raw JSON data into Hadoop for running batch mapreduce jobs.
I've been reading up on Kafka, Flume, Scribe, S4, Storm and a whole load of other tools but I'm still not sure which is best suited to the task at hand. These seem to be either focussed on logfile data, or real-time processing of the activity stream, whereas I guess I'm more interested in doing ETL on activity streams.
The type of setup I'm thinking of is where I provide a configuration for all the streams I'm interested in (URLs, params, credentials), and the tool periodically polls them, dumps the output in HDFS, and also has a hook for me to process and transform the JSON for insertion into the datamart.
Do any of the existing open-source tools fit this case particularly well?
(In terms of scale I expect a max of 30,000 users interacting with ~10 systems - not simultaneously - so not really "Big Data", but not trivial either.)
Thanks!
[1] http://activitystrea.ms/
You should check out streamsets.com
It's an open source tool (and free to use) built exactly for these kinds of use cases.
You can use the HTTP Client source and HDFS destination to achieve your main goals. If you decide you need to use Kafka or Flume as well, support for both are also built-in.
You can also do transformations in a variety of ways including writing python or javascript for more complex transformations (or your own stage in Java if you choose).
You can also check out logstash (elastic.co) and NiFi to see if one of those works better for you.
*Full disclosure, I'm an engineer at StreamSets.

How to solve batch processing in ESB?

We have a legacy system that produces files that each contains hundreds of messages (financial transactions). We need to transform these messages into another format and submit them (individually) to a target system. The question is:
Should ESB accept these files for processing directly, or should there be an adapter application between the legacy system and ESB that would split received files into individual messages and let the ESB to process the messages individually (instead of processing the whole file)?
In the first solution we expect two ESB flows. The first one would transform the file into a new format, split it into the messages, and store these messages into a temporary location. The transformation needs to process the file as a whole, because the file contains some common sections that are needed for transformation of the individual messages.
The second flow would take the individual transformed messages (each in a separate DB transaction), pass them to the target system, and wait for its answer (synchronously or asynchronously).
The second solution would replace the first flow by an external application that would transform the file, split it into individual transformed messages, and store them in a temporary location (local file system). The second flow would stay in the ESB.
In our eyes, the disadvantage of the first solution is in that the ESB would have to process huge files (in the first flow), which is commonly considered an antipattern. On the other hand, the ESB would adjust directly to the interface of the legacy system, which is one of the purposes of ESB.
In the second solution, the adapter application would contain the transformation logic, which should be another of the purposes and responsibilities of ESB.
What is the commonly suggested solution for this situation (a pattern)? Could you provide some references that are more descriptive than these two links that I've found?
http://publib.boulder.ibm.com/infocenter/esbsoa/wesbv7r5/index.jsp?topic=%2Fcom.ibm.websphere.wesb.programming.doc%2Ftopics%2Fesbprog_patterns.html
https://www.ibm.com/developerworks/wikis/display/esbpatterns/File+Processing
Edit
Another reference:
http://www.ibm.com/developerworks/webservices/library/ws-largemessaging/
Remember that there are 3 message types in SOA: Command, Event, Document
That 'Document' bit is for chunks of data. It is probably better suited to 'real' document types such as 'Order' or 'Invoice' and the like but there is nothing stopping you from going with 'TransactionBatch'.
That being said, it is a rather unused message type in that not many service buses actually implement anything around it, since:
you do not really need it
many message queuing technologies have limits on message size (as low as 4kb) making it difficult to transport any large message (needs to be sent in chunks)
So what I would do in your scenario is have an endpoint that processes the file. So something like a ProcessTransactionFileCommand sent to the processing endpoint and in it you only have a reference to the actual file (stored somewhere in the file system or even a url to download from). That processing endpoint can process the file and send the individual messages (all within a transaction) to the integration endpoint that sends the message off to the external system. You could have a SendTransactionCommand to do that.
In this way your system is very flexible in that the integration endpoint can receive individual integration commands form some parts of your solution while the processing endpoint can handle the batch and split them into individual integration commands.
Should you be in the .NET space you may want to look at my FOSS service bus project: http://shuttle.codeplex.com/
But any service bus will do the trick (MassTransit, NServiceBus, etc.)
You can use an ESB for the first case and I don't think it would be an anti-pattern. The purpose of an ESB it's also to integrate legacy applications, that create files as output as in your use case, with other applications.
You can try Mule ESB. It will allow you to consume the file using streaming (through the file transport), map the content of the file to your desired output using a GUI called DataMapper and finally put those messages en a VM queue which can be a persistent queue within the ESB. This queues are transactional so you can guarantee that all the messages created from one file were put on the VM queue or none of them.
Then you can, from another flow (in fact processed within the ESB are called flows in mule) read each of those messages and process them.
HTH, Pablo.