Bit of a.newb when it comes to this but I have around 15,000 html files with XBRL data in them.
I've downloaded these files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html
Ideally I want to extract from all of these files information related to the company's name and intangible assets but I'm unsure how to do this.
Ideally I'd want to export the data in to columns in a single excel file.
Any help would be appreciated.
A bit late to answer, but never mind. As a start, you could have a look at VT Fact Viewer. It can give you a grid display of the XBRL facts in the document and you can export them to Excel. Once there you'll need to do some filtering looking for tags like "core:IntangibleAssets" or maybe "uk-gaap:Intangible...." sort of things.
However, if you're doing this on a lot of documents (such as the CH data dump) then you're going to need to start doing some "proper" xml processing of your own using a programming or scripting language. But, the viewer will still be helpful as it will show you the sort of things you are aiming to extract.
As a simple example the following will get you some Intangible asset data in CSV format which you can open in Excel.
Written in C# (using LINQPad) so you'll have to translate if required:
string fname = #"C:\ch_data\Prod223_1770_00101234_20160331.html";
var doc = XDocument.Load(fname);
// The 'ix' namespace may use 2008 or 2013 schema so we'll just use the .LocalName property of the tag
var elements = doc.Root
.Descendants()
.Where(x => x.Name.LocalName == "nonFraction")
.Where(x => x.Attributes().Any(a => a.Value.Contains("Intangible")));
var lines = new List<string>();
foreach (var element in elements)
{
var attribs = element.Attributes();
var ctx = attribs.FirstOrDefault(a => a.Name == "contextRef")?.Value ?? "";
var dec = attribs.FirstOrDefault(a => a.Name == "decimals")?.Value ?? "";
var scale = attribs.FirstOrDefault(a => a.Name == "scale")?.Value ?? "";
var units = attribs.FirstOrDefault(a => a.Name == "unitRef")?.Value ?? "";
var fmt = attribs.FirstOrDefault(a => a.Name == "format")?.Value ?? "";
var name = attribs.FirstOrDefault(a => a.Name == "name")?.Value ?? "";
var value = element.Value;
string line = $"\"{ctx}\",\"{dec}\",\"{scale}\",\"{units}\",\"{name}\",\"{fmt}\",\"{value}\"";
lines.Add(line);
//Console.WriteLine(line);
}
File.WriteAllLines(Path.ChangeExtension(fname, "csv"), lines);
Change the input filename to loop through a directory or list of filenames as appropriate.
Related
I am creating the HTML template in Salesforce and below is the value I am fetching from one field that is building address. However, I just need the building number from the entire address. Below is the scenario.
“Václavské náměstí 785/28, P1 - Alfa Building”, this is the text and I want to extract the only the number i.e 785/28. But the thing is the numbers before and after the ‘/’ varies it can be more than 3 or 2 digits. Trim Left and Right work but can't seem to specify the values dynamically.
Thanks
Please find the code for your above concern. It is in javascript only.
<script>
function myFunction() {
var str = "sdfdfv123456/789xvxcv"; // pass your string over here
var revstr = reverseString(str);
var firstDigit = str.match(/\d/)
var lastDigit = revstr.match(/\d/)
var findex = str.indexOf(firstDigit);
var lindex = str.length - revstr.indexOf(lastDigit);
function reverseString(str) {
if (str === "")
return "";
else
return reverseString(str.substr(1)) + str.charAt(0);
}
var sub_string = str.substring (findex, lindex);
alert(sub_string) ;
}
</script>
If any confusion please let me know.
I'm looking for some help. I am trying to grab an author's publications from PubMed and populate the data into Google Sheets using Apps Script. I've gotten as far as the code below and am now stuck.
Basically, what I have done was first pull all the Pubmed IDs from a particular author whose name comes from the name of the sheet. Then I have tried creating a loop to go through each Pubmed ID JSON summary and pull each field I want. I have been able to pull the pub date. I had set it up with the idea that I would do a loop for each field of that PMID I want, store it in an array, and then return it to my sheet. However, I'm now stuck trying to get the second field - title - and all the subsequent fields (e.g. authors, last author, first author, etc.)
Any help would be greatly appreciated.
function IMPORTPMID(){
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var author = sheet.getSheetName();
var url = ("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=" + author + "[author]&retmode=json&retmax=1000");
var response = UrlFetchApp.fetch(url);
var AllAuthorPMID = JSON.parse(response.getContentText());
var xpath = "esearchresult/idlist";
var patharray = xpath.split("/");
for (var i = 0; i < patharray.length; i++) {
AllAuthorPMID = AllAuthorPMID[patharray[i]];
}
var PMID = AllAuthorPMID;
var PDparsearray = [PMID.length];
var titleparsearray = [PMID.length];
for (var x = 0; x < PMID.length; x++) {
var urlsum = ("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&rettype=abstract&id=" + PMID[x]);
var ressum = UrlFetchApp.fetch(urlsum);
var contentsum = ressum.getContentText();
var jsonsum = JSON.parse(contentsum);
var PDpath = "result/" + PMID[x] + "/pubdate";
var titlepath = "result/" + PMID[x] + "/title";
var PDpatharray = PDpath.split("/");
var titlepatharray = titlepath.split("/");
for (var j = 0; j < PDpatharray.length; j++) {
var jsonsum = jsonsum[PDpatharray[j]];
}
PDparsearray[x] = jsonsum;
}
var tempArr = [];
for (var obj in AllAuthorPMID) {
tempArr.push([obj, AllAuthorPMID[obj], PDparsearray[obj]]);
}
return tempArr;
}
From a PubMed JSON response for a given PubMed ID, you should be able to determine the fieldnames (and paths to them) that you want to include in your summary report. Reading them all is simpler to implement if they are all at the same level, but if some are properties of a sub-field, you can still access them if you give the right path in your setup.
Consider the "source JSON":
[
{ "pubMedId": "1234",
"name": "Jay Sahn",
"publications": [
{ "pubId": "abcd",
"issn": "A1B2C3",
"title": "Dynamic JSON Parsing: A Journey into Madness",
"authors": [
{ "pubMedId": "1234" },
{ "pubMedId": "2345" }
]
},
{ "pubId": "efgh",
...
},
...
],
...
},
...
]
The pubId and issn fields would be at the same level, while the publications and authors would not.
You can retrieve both the pubMedId and publications fields (and others you desire) in the same loop by either 1) hard-coding the field access, or 2) writing code that parses a field path and supplying field paths.
Option 1 is likely to be faster, but much less flexible if you suddenly want to get a new field, since you have to remember how to write the code to access that field, along with where to insert it, etc. God save you if the API changes.
Option 2 is harder to get right, but once right, will (should) work for any field you (properly) specify. Getting a new field is as easy as writing the path to it in the relevant config variable. There are possibly libraries that will do this for you.
To convert the above into spreadsheet rows (one per pubMedId in the outer array, e.g. the IDs you queried their API for), consider this example code:
function foo() {
const sheet = /* get a sheet reference somehow */;
const resp = UrlFetchApp.fetch(...).getContentText();
const data = JSON.parse(resp);
// paths relative to the outermost field, which for the imaginary source is an array of "author" objects
const fields = ['pubMedId', 'name', 'publications/pubId', 'publications/title', 'publications/authors/pubMedId'];
const output = data.map(function (author) {
var row = fields.map(function (f) {
var desiredField = f.split('/').reduce(delve_, author);
return JSON.stringify(desiredField);
});
return row;
});
sheet.getRange(1, 1, output.length, output[0].length).setValues(output);
}
function delve_(parentObj, property, i, fullPath) {
// Dive into the given object to get the path. If the parent is an array, access its elements.
if (parentObj === undefined)
return;
// Simple case: parentObj is an Object, and property exists.
const child = parentObj[property];
if (child)
return child;
// Not a direct property / index, so perhaps a property on an object in an Array.
if (parentObj.constructor === Array)
return collate_(parentObj, fullPath.splice(i));
console.warn({message: "Unhandled case / missing property",
args: {parent: parentObj, prop: property, index: i, pathArray: fullPath}});
return; // property didn't exist, user error.
}
function collate_(arr, fields) {
// Obtain the given property from all elements of the array.
const results = arr.map(function (element) {
return fields.slice().reduce(delve_, element);
});
return results;
}
Executing this yields the following output in Stackdriver:
Obviously you probably want some different (aka real) fields, and probably have other ideas for how to report them, so I leave that portion up to the reader.
Anyone with improvements to the above is welcome to submit a PR.
Recommended Reading:
Array#reduce
Array#map
Array#splice
Array#slice
Internet references on parsing nested JSON. There are a lot.
I am importing data from a JSON file using Google Apps Script and Google Sheets. I have learned the basics on this, but the formatting on the JSON file I am attempting to parse is throwing me off.
What is confusing me is how I would search for information based on "name". Currently I am using this:
function JSONReq(url, xpath){
var res = UrlFetchApp.fetch(url);
var content = res.getContentText();
var json = JSON.parse(content);
var patharray = xpath.split("/");
for(var i = 0; i < patharray.length; i++){
json = json[patharray[i]];
}
return json;
}
I'm a bit lost now to be honest with you.
I want to have a cell where I can type a name that I already know of, then find it in the JSON file and pull the return that information however I decide to do it. I can pull and write to cells, I have the basics down. But I just can't understand how I could search by the name.
That JSON file is an array of objects. To find a specific object with a given "name", you would parse it into an object (which you do already), then iterate through them and check the name parameter:
var myName = "name of thing I want";
var arr = JSON.parse( ... );
for(var i = 0; i < arr.length; ++i) {
var obj = arr[i];
if(obj.name == myName) { // could be done as obj["name"] == ... too
// do stuff with obj
}
}
For your case, you might add an additional argument to your function (i.e. 2nd arg = the object's property, e.g. "name", with the 3rd = the desired value. This will be fine for any simple key-value properties, but would need specific handling for where the value is itself an object (e.g. the "category" field in your specific JSON file).
https://github.com/guyonroche/exceljs
I m new to exceljs and just see the description of exceljs at github i
.e : "Read, manipulate and write spreadsheet data and styles to XLSX
and JSON."
I need to convert workbook into JSON object and did not find any method / module like for csv in exceljs.
let me know if there is one.
I know this question is old but I am posting an answer for others who may be in the same position I am today.
Yes, there is.
I was looking for the same answer as you, but without success. After a bit of search, I found a way to do it.
import Exceljs from 'exceljs';
const workbook = new Exceljs.Workbook();
await workbook.xlsx.load(data);
const json = JSON.stringify(workbook.model);
console.log(json); // the json object
I combined the response to this issue and Exceljs' read me to figure it out. I hope this helps.
It's very simple to get values from excel file using exceljs.
const book = [];
workbook.eachSheet( sheet => {
const sheet = [];
worksheet.eachRow(row => {
sheet.push(row.values);
});
book.push(sheet);
});
But take a look on array indexes. Cells usually starts from 1.
// data
let excelTitles = [];
let excelData = [];
// excel to json converter (only the first sheet)
workbook.worksheets[0].eachRow((row, rowNumber) => {
// rowNumber 0 is empty
if (rowNumber > 0) {
// get values from row
let rowValues = row.values;
// remove first element (extra without reason)
rowValues.shift();
// titles row
if (rowNumber === 1) excelTitles = rowValues;
// table data
else {
// create object with the titles and the row values (if any)
let rowObject = {}
for (let i = 0; i < excelTitles.length; i++) {
let title = excelTitles[i];
let value = rowValues[i] ? rowValues[i] : '';
rowObject[title] = value;
}
excelData.push(rowObject);
}
}
})
console.log(excelData);
return;
I was a user of the deprecated ScriptDB. The use I made of ScriptDB was fairly simple: to store a certain amount of information contained on a panel options, this way:
var db = ScriptDb.getMyDb();
function showList(folderID) {
var folder = DocsList.getFolderById(folderID);
var files = folder.getFiles();
var arrayList = [];
for (var file in files) {
file = files[file];
var thesesName = file.getName();
var thesesId = file.getId();
var thesesDoc = DocumentApp.openById(thesesId);
for (var child = 0; child < thesesDoc.getNumChildren(); child++){
var thesesFirstParagraph = thesesDoc.getChild(child);
var thesesType = thesesFirstParagraph.getText();
if (thesesType != ''){
var newArray = [thesesName, thesesType, thesesId];
arrayList.push(newArray);
break;
}
}
}
arrayList.sort();
var result = db.query({arrayName: 'savedArray'});
if (result.hasNext()) {
var savedArray = result.next();
savedArray.arrayValue = arrayList;
db.save(savedArray);
}
else {
var record = db.save({arrayName: "savedArray", arrayValue:arrayList});
}
var mydoc = SpreadsheetApp.getActiveSpreadsheet();
var app = UiApp.createApplication().setWidth(550).setHeight(450);
var panel = app.createVerticalPanel()
.setId('panel');
var label = app.createLabel("Choose the options").setStyleAttribute("fontSize", 18);
app.add(label);
panel.add(app.createHidden('checkbox_total', arrayList.length));
for(var i = 0; i < arrayList.length; i++){
var checkbox = app.createCheckBox().setName('checkbox_isChecked_'+i).setText(arrayList[i][0]);
panel.add(checkbox);
}
var handler = app.createServerHandler('submit').addCallbackElement(panel);
panel.add(app.createButton('Submit', handler));
var scroll = app.createScrollPanel().setPixelSize(500, 400);
scroll.add(panel);
app.add(scroll);
mydoc.show(app);
}
function include(arr, obj) {
for(var i=0; i<arr.length; i++) {
if (arr[i] == obj) // if we find a match, return true
return true; }
return false; // if we got here, there was no match, so return false
}
function submit(e){
var scriptDbObject = db.query({arrayName: "savedArray"});
var result = scriptDbObject.next();
var arrayList = result.arrayValue;
db.remove(result);
// continues...
}
I thought I could simply replace the ScriptDB by userProperties (using JSON to turn the array into string). However, an error warns me that my piece of information is too large to be stored in userProperties.
I did not want to use external databases (parse or MongoDB), because I think it isn't necessary for my (simple) purpose.
So, what solution I could use as a replacement to ScriptDB?
You could store a string using the HtmlOutput Class.
var output = HtmlService.createHtmlOutput('<b>Hello, world!</b>');
output.append('<p>Hello again, world.</p>');
Logger.log(output.getContent());
Google Documentation - HtmlOutput
There are methods to append, clear and get the content out of the HtmlOutput object.
OR
Maybe create a Blob:
Google Documentation - Utilities Class - newBlob Method
Then you can get the data out of the blob as a string.
getDataAsString
Then if you need to you can convert the string to an object if it's in the right JSON format.
Firstly, if you're hitting the limits on the Properties service, I would recommend you look at an alternative external store, as you're manipulating a large amount of data, and any workaround given here is possibly going to be slower and less efficient then simply using a dedicated service.
Alternatively of course, you could look at making your data come under the limits for the properties service by splitting it up and using multiple properties etc.
One other alternative would be to use a Google Doc or Sheet to store the string. When you're required to pull the data again, you can simply access the sheet and get the string, but this might be slow depending on the size of the string. At a glance it looks like you're just pulling Data on the folders in your drive, so you could consider writing it to a sheet, which would allow you to even display the information in a user friendly way. Given your use of arrays already, you can write them to a sheet easily using .setValues() if you convert them to a 2D array.
Bruce McPherson has done a lot of work on abstracting databases. Take a look at his cDbAbstraction library then you could easily chop and change which DB you use and compare performance. Maybe even create a cDbAbstraction library to use HTMLOutput (I like that idea Sandy, Bruce does some funky stuff with parallel processes via HTMLService)