How to read subdirectory data into bigsql table? - bigsql

Bigsql is not able to read data from subdirectory like hive can by setting parameters.
set tblproperties (
"hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE")
I tried adding above parameters into bigsql tableproperties but it's not able to read subdirectory data.
What parameters I need to set in bigsql to read subdirectory data?

As per my understanding, Bigsql only looks up for files in parent directory. It doesn't even bother checking sub-directories. It will display an empty table because it cannot read the data recursively.This feature is still under product improvement ideas with IBM product and engineering team.

Related

Is there a way to insert the filename into the record when importing a JSON file into Power BI?

Not sure how to ask this but here goes. I have a collection of 500+ JSON files that I need to import into PowerBI. Each JSON has four different levels of information that I need to parse out. I converted the JSON top-level info into a table and transposed it so I had one row like the attached screenshot.
enter image description here
My first question is: can I easily add the filename to the JSON record? I would like to use the filename as a unique key in later queries.
Thanks!
It looks like you may be connecting to each JSON file individually? If I'm correct, assuming all the JSON files can be in a single folder, you can use the "Folder" connection. That then allows you to right-click on the original folder query and choose "reference" to then create various transformations for each JSON file, and it includes the file name.
Related details:
https://powerbi.tips/2016/06/loading-data-from-folder/
https://learn.microsoft.com/en-us/power-bi/guidance/power-query-referenced-queries
Hoping that helps!

Import json files from s3 into postgres RDS

I want to make a script (maybe lambda?) so every new json file uploaded to this s3 is also uploaded directly into a postgres table located in PostgreSQL RDS.
The json in nested and contains lists of jsons inside, so it is not that simple to just parse it in Postgres. In addition, it has a changing number of columns, so a new file may add up a new column to the table. (If a file has a new column that didn't appear yet, I want to add it and put null objects for the rest of the table values).
How can I do it efficiently?
As suggested, you can write lambda to listen to S3 events and trigger a function when a new file is uploaded.
https://n2ws.com/blog/aws-automation/lambda-function-s3-event-triggers
One event is trigged you need to read & parse the file.
Now connect to database & run sql queries after generating them from the object.

Extracting array properties from Cosmos DB documents using Azure Data Factory

I have an Azure Data Factory v2 pipeline that's pulling data from a Cosmos DB collection. This collection has a property that's an array.
I want to, at the least, be able to dump that entire property's value into a column in SQL Azure. I don't need it parsed (although that would be great too), but ADF lists this column as "Unsupported Type" in the dataset definition and listed it in the Excluded Columns section.
Here is an example of the JSON I'm working with. The property I want is "MyArrayProperty":
{
"id": "c4e2012e-af82-4c48-8960-11e0436e6d3f",
"Created": "2019-06-14T16:04:13.9572567Z",
"Updated": "2019-06-14T16:04:14.1920988Z",
"IsActive": true,
"MyArrayProperty": [
{
"SomeId": "a4427015-ca69-4958-90d3-0918fd5dcac1",
"SomeName": "BlahBlah"
}
]
}
}
I've tried manually specifying a column in the ADF data source like "MyArrayProperty" and using a string data type, but the value always comes across as null.
please check this document about schema mapping example between MongoDB and Azure SQL. Basically you should define your collectionReference that will iterate through your nested array of objects and do cross apply.
There may be a better way to solve this problem, but I ended up creating a second copy activity which uses a query against Cosmos rather than a collection based capture. The query flattened the array like so:
SELECT m.id, c.SomeId, c.SomeName
FROM myCollection m join c in m.MyArrayProperty
I then took this data set and dumped it into a table in SQL then did my other work inside SQL Azure itself. You could also use the new Join pipeline task to do this in memory before it gets to the destination.

Mass Upload Files To Specific Contacts Salesforce

I need to upload some 2000 documents to specific users in salesforce. I have a csv file that has the Salesforce-assigned ContactID, as well as a direct path to the files on my desktop. Each contact's specific file url has been included in the csv. How can I upload them all at one and, especially, to the correct contact?
You indicated in the comments / chat that you want it as "Files".
The "Files" object is bit more complex than Attachments, you'll need to do it in 2-3 steps. What you see as a File (you might see it referred to in documentation as Chatter Files or Salesforce Content) is actually several tables. There's
ContentDocument which can be kind of a file header (title, description, language, tags, linkage to many other areas in SF - because it can be standalone, it can be uploaded to certain SF Content Library, it can be linked to Accounts, Contacts, $_GOD knows what else)
ContentVersion which is well, actual payload. Only most recent version is displayed out of the box but if you really want you can go back in time
and more
The crap part is that you can't insert ContentDocument directly (there's no create() call in the list of operations) .
Theory
So you'll need:
Insert ContentVersion (v1 will automatically create for you parent ContentDocuments... it does sound bit ass-backwards but it works). After this is done you'll have a bunch of standalone documents loaded but not linked to any Contacts
Learn the Ids of their parent ContentDocuments
Insert ContentDocumentLink records that will connect Contacts and their PDFs
Practice
This is my C:\stacktest folder. It contains some SF cheat sheet PDFs.
Here's my file for part 1 of the load
Title PathOnClient VersionData
"Lightning Components CheatSheet" "C:\stacktest\SF_LightningComponents_cheatsheet_web.pdf" "C:\stacktest\SF_LightningComponents_cheatsheet_web.pdf"
"Process Automation CheatSheet" "C:\stacktest\SF_Process_Automation_cheatsheet_web.pdf" "C:\stacktest\SF_Process_Automation_cheatsheet_web.pdf"
"Admin CheatSheet" "C:\stacktest\SF_S1-Admin_cheatsheet_web.pdf" "C:\stacktest\SF_S1-Admin_cheatsheet_web.pdf"
"S1 CheatSheet" "C:\stacktest\SF_S1-Developer_cheatsheet_web.pdf" "C:\stacktest\SF_S1-Developer_cheatsheet_web.pdf"
Fire Data Loader, select Insert, select showing all Salesforce objects. Find ContentVersion. Load should be straightforward (if you're hitting memory issues set batch size to something low, even 1 record at a time if really needed).
You'll get back a "success file", it's useless. We don't need the Ids of generated content versions, we need their parents... Fire "Export" in Data Loader, pick all objects again, pick ContentDocument. Use query similar to this:
Select Id, Title, FileType, FileExtension
FROM ContentDocument
WHERE CreatedDate = TODAY AND CreatedBy.FirstName = 'Ethan'
You should see something like this:
"ID","TITLE","FILETYPE","FILEEXTENSION"
"0690g0000048G2MAAU","Lightning Components CheatSheet","PDF","pdf"
"0690g0000048G2NAAU","Process Automation CheatSheet","PDF","pdf"
"0690g0000048G2OAAU","Admin CheatSheet","PDF","pdf"
"0690g0000048G2PAAU","S1 CheatSheet","PDF","pdf"
Use Excel and magic of VLOOKUP or other things like that to link them back by title to Contacts. You wrote you already have a file with Contact Ids and titles so there's hope... Create a file like that:
ContentDocumentId LinkedEntityId ShareType Visibility
0690g0000048G2MAAU 0037000000TWREI V InternalUsers
0690g0000048G2NAAU 0030g000027rQ3z V InternalUsers
0690g0000048G2OAAU 0030g000027rQ3a V InternalUsers
0690g0000048G2PAAU 0030g000027rPz4 V InternalUsers
1st column is the file Id, then contact Id, then some black magic you can read about & change if needed in ContentDocumentLink docs.
Load it as insert to (again, show all objects) ContentDocumentLink.
Woohoo! Beer time.
Your CSV should contain following fields :
- ParentID = Id of object you want to link the attachment to (the ID of the contact)
- Name = name of the file
- ContentType = extension(.xls or .pdf or ...)
- OwnerId = if empty I believe it takes your user as owner
- body = the location on your machine of the file (for instance: C:\SFDC\Files\test.pdf
Use this csv to insert the records (via data loader) into the Attachment object.
You will then see for each contact, that records have been added to the 'Notes & Attachments' related list.

processing multiple files in business objects data services

I am new to the Business Objects Data services.
I have to run a dataflow reading from a file. Filename should be read based on wild chars like Platform. And I want to run the dataflow only if the file exists, if file is not present , it should not error out or should not do anything but it should just move on to the next dataflow or workflow in the job.
I tried below code to check if the file exists as built_in function File_Exists cannot check the file based on wild chars.
*$FILEEXISTSFLAG= exec('/bin/ksh',' "ls xxxxxx/Platform.csv',8);*
My intention is based on the value assigned to $FILEEXISTSFLAG from above code, I will decide whether to execute the data flow or not (if $FILEEXISTSFLAG is null do nothing otherwise execute the data flow ) but its retrieving below output.
*ls: cannot access /xxxxxx/Platform.csv: No such file*
Is there any other way to achieve this?
I was able to solve the above problem by using the index function.
$FILEEXISTSFLAG is containing a value like "ls: cannot access Platform: No such file or directory ". So, I have used the index function as below. So if the output is not null for below index function, it will execute the dataflow, otherwise it will do nothing.
index( $FILEEXISTSFLAG , 'No such file',1)
Thanks,
Phani.