harvesting data from ckan instance - data-analysis

I am harvesting data from ckan instance and collecting in an other ckan instance. I am totally new to this subject. Though I am able to harvest datasets in my ckan instance, but I am not able to understand the next step, how to do pretreatment and postreatment of data, what should be my next step? Here is the screenshot of my current collection of datasets.
Please suggest how should I go about it.

You would have to write your own "harvester", probably taking the CKAN harvester as a basis and change it for your needs.
Instructions here:
https://github.com/ckan/ckanext-harvest#the-harvesting-interface
The import_stage is what you want to add to.

Related

importing Oozie workflows into Cloudera Hue with shared groups

My team does a lot of work staging data from other sources. I've worked to automate/script this process as much as possible because some of these sources have hundreds of tables to ingest. My pain point is in sharing my scripted Oozie workflows.
I am able to generate JSON files which are imported into Hue, but the JSON files do not specify what users/groups can read or modify them. This means I would have to manually go through every workflow and share them. I may be wrong, but it appears that sharing information is stored in the backend of Hue, not the workflows themselves.
Is it possible to add a key value pair inside of the JSON import file to specify the groups to share with?
or
Am I missing an easier way to bulk share?

Developing a A web-based real time dashboard

I want to consume kafka stream into mysql using Python; on top if which I want to build a realtime web based (web app) dashboard that will automatically be refreshed (ajax) on each data insert in the database.
After some searching, found a suggestion that ajax is not good for this purpose.
This post said websockets are better than ajax in terms of Performance.
Because I am not sure on whats the best way to achieve this So your expert advice is needed.
Thanks.
"I want to consume kafka stream into mysql using Python; on top if which I want to build a realtime web based (web app) dashboard that will automatically be refreshed (ajax) on each data insert in the database."
... (wince!) ...
Pretty-please find someone among your peers who can save you from yourself. (And is there any possible way that I can say this to you, such that you can "save face?" I can't think of any.) I'm not-at-all trying to have public fun at your expense. Please – "talk immediately to your manager. (S)He, surely, can help you."
I am certainly not an expert in this field, however my company uses Elastic/Kibana to read in from Kafka topics and display the data on a dashboard. This is just one of many routes you can take, but it works very well for us. You can read a little more into it here:
https://www.elastic.co/blog/just-enough-kafka-for-the-elastic-stack-part1

Hosting JSON Files for Mobile Application

I am creating a mobile application using swift for my organization. The application reads in data in JSON format to populate the information that gets displayed on the application. I already have a method to generate the JSON files, but I need somewhere to host the actual files. I have an AWS account and an instance running, this is where I initially was hosting my JSON files but I got an email from AWS saying that having the app constantly grab the JSON files that I stored on the site resembled scanning behaviour, which is not allowed apparently. So I was wondering where I could host JSON files so that my mobile app can read in the information it needs. The biggest thing that I need is that I can host it with a static URL that I can keep calling with my app.
I was thinking of potentially putting the files on an AWS bucket with read permissions and having those get accessed, but since AWS already complained about me doing something like that I'm iffy. I was also thinking of putting the JSON files on Github, but again I'd hate to get an email from github telling me that they don't like that an application keeps grabbing the data.
For background, the app essentially has a hardcoded URL that grabs the JSON data and parses it. I didn't do an api because an API takes some time to grab all the information that doesn't really change that often, it's much easier to generate the JSON files locally and just post them online somewhere. The information on it can be read by anyone too it's not private or anything.
Message from AWS:
Hello,
We've received a report(s) that your AWS resource(s)
information
has been implicated in activity which resembles scanning remote hosts on the internet for security vulnerabilities. Activity of this nature is forbidden in the AWS Acceptable Use Policy (https://aws.amazon.com/aup/). We've included the original report below for your review.
Please take action to stop the reported activity and reply directly to this email with details of the corrective actions you have taken. If you do not consider the activity described in these reports to be abusive, please reply to this email with details of your use case.
If you're unaware of this activity, it's possible that your environment has been compromised by an external attacker, or a vulnerability is allowing your machine to be used in a way that it was not intended.
We are unable to assist you with troubleshooting or technical inquiries. However, for guidance on securing your instance, we recommend reviewing the following resources:
I'm new so it won't let me post links but they attached a couple help links
If you require further assistance with this matter, you can take advantage of our developer forums:
more links I can't have
Or, if you are subscribed to a Premium Support package, you may reach out for one-on-one assistance here:
link
Please remember that you are responsible for ensuring that your instances and all applications are properly secured. If you require any further information to assist you in identifying or rectifying this issue, please let us know in a direct reply to this message.
Regards,
AWS Abuse
Abuse Case Number:
Using an AWS EC2 instance to host static files (which is what it sounds like you were doing?) is pretty standard and I suspect that this is not what Amazon is complaining about. More likely, your instance has been infected by some sort of software which is causing it to request many files from other random servers on the web ("scanning for remote vulnerabilities"). You should check that you have not accidentally publicly posted your AWS credentials (in any form), and consider wiping the instance and resetting it. And of course reply to the email explaining this to AWS.

Dynamically changing Report's Shared Data Source at Runtime

I'm looking to use SSRS for multi-tenant reporting and I'd like the ability to have runtime-chosen Shared Data Sources for my reports. What do I mean by this? Well, I could be flexible but I think the two most likely possibilities are (however, I'm also open to other possibilities):
The Shared Data Source is dictated by the client's authentication. In my case, the "client" is a .NET application and not the user, so if this is a viable path then I'd like to somehow have the MainDB (that's what I'm calling it) Shared Data Source selected by the Service Account that the client logs in as.
Pass the name of the Shared Data Source as a parameter and let that dictate which one to use. Given that all of my clients are "trusted players", I am comfortable with this approach. While each client will have its own representative Service Account, it's just for good measure and should not be important. So instead of just calling the data source MainDB, we could instead have Client1DB and Client2DB, etc. It's okay if a new data source means a new deployment but I need this to scale easily enough as well to ~50 different data sources over time.
Why? Because we have multiple/duplicate copies of our production application for multiple customers but we don't want to duplicate everything, just the web apps and databases. We're fine with some common "back-end" things. And for SSRS, because of how expensive licenses are (and how rarely reports are ran by our users), we really want to have just a single back-end for all of our customers (I actually have a second one on standby for manual disaster recovery situations - we don't need to be too fancy here as reports are the least important DR concern we have).
I have seen this question which points to this post but I was really hoping there was a better way than this. Because of all of those additional steps/efforts/limitations/etc, I'd rather just use PowerShell to script duplicate deployments of the reports with tweaked hardcoded data sources instead of standardizing on the steps in that post. That solution feels WAY too hacky to me and doesn't seem to scale very well at all.
I've done this a bunch of terrible ways (usually hardcoded in a dynamic script), and then I discovered its actually quite simple.
Instead of using Shared Connection, use the Embedded Connection and create your Connection string based on params (or any string manipulation code)....

Data warehouse fetch data directly from db or through api

We need to pull data into our data warehouse. One of the data source is from internal.
We have two options:
1. Ask the data source team to expose the data through API.
2. Ask the data source team dump the data at a daily base, grant us a read only db credential to access the dump.
Could anyone please give some suggestions?
Thanks a lot!
A lot of this really depends on the size and nature of the data, what kind of tools you are using, whether the data source team knows of "API", etc.
I think we need a lot more information to make an informed recommendation here. I'd really recommend you have a conversation with your DBAs, see what options they have available to you, and seriously consider their recommendations. They probably have far more insight than we will concerning what would be most effective for your problem.
API solution Cons:
Cost. Your data source team will have to build the api. Then you will have to build the client application to read the data from the api and insert it into the db. You will also have to host the api somewhere as well as design the deployment process. It's quite a lot of work and I think it isn't worth it.
Pefromance. Not necessary but usually when it comes to datawarehouses it means one has to deal with a lot of data. With an api you will most likely have to transform your data first before you can use bulk insert features of your database
Daily db dumps solution looks way better for me but I would change it slightly if I were you. I would use a flat file. Most databases have a feature to bulk insert data from a file and it usually turns out to be the fastest wat to accomplish the task.
So from what I know from your question I think you should the following:
Aggree with you data source team on the data file format. This way you can work independently and even use different RDBMS.
Choose a good share to which both the data source team db and your db have fast access to.
Ask the data source team to implement the export logic to a file.
Implement the import logic from a file.
Please note items 3 and 4 should take only a few lines of code. As I said most databases have built in and OPTIMIZED functionality to export/import data to a file.
Hope it helps!