issue with Apache POI when converting .docx to a json document format. - json

I am currntly parsing a 26 page .docx with images,tables,italics,underlines. I am able to clear
Using apache POI I created XWPF document format with list of XWPF paragraphs. When i iterate through XWPF paragraphs, I am not able to get styles (italics,underlines,bolds) for individual lines if a single paragraph contains different styles.
i have tried using XWPF.paragraph.getrun(). XWPF...run.getfamilyfont() i am getting null. But i get the data at the paragraph level when i run XWPF.paragraph.getstyle()
Please do let me know if you have encountered similar issues.

I hope these code can help you , you can get some style from CTRPr object.
CTRPr rPr = run.getCTR().getRPr();
if(rPr!=null){
CTFonts rFonts = rPr.getRFonts();
if(rFonts!=null){
String eastAsia = rFonts.getEastAsia();
String hAnsi = rFonts.getHAnsi();
Enum hAnsiTheme = rFonts.getHAnsiTheme();
}
}

Related

Need to get the 2sxc Field Type from an Entity

After getting everything working on this previous question Is there a way to Clone one or many Entities (records) in Code, I wanted to clean it up and make it more useful/reusable. So far, I am deciding how to copy/add the field using the content-type field names, so attribute.Key inside the foreach on Attributes. What I need instead is to know the Entity field's Type; meaning String, Number, Hyperlink, Entity, etc.
So I want something like if(AsEntity(original).FieldType == "HyperLink") { do this stuff }. I have explored the API docs but have not spotted how to get to the info. Is it possible?
I did figure out that the attribute.Value has a Type that I could use to answer most of them, but Hyperlink and String are both showing, System.String.
Here are, in order, String, Hyperlink, Entity, and Number:
atts: ToSic.Eav.Data.Attribute`1[System.String]
atts: ToSic.Eav.Data.Attribute`1[System.String]
atts: ToSic.Eav.Data.Attribute`1[ToSic.Eav.Data.EntityRelationship]
atts: ToSic.Eav.Data.Attribute`1[System.Nullable`1[System.Decimal]]
So is there a way from the Entity or its Attributes or some other pathway of object/methods/properties to just get the answer as the field Type name? Or is there a wrapper of some kind I can get to that will let me handle (convert to/from) Hyperlinks? I am open to other ideas. Since the fields.Add() is different by "FieldType" this would be really helpful.
It's kind of simple, but needs a bit more code because of the dynamic nature of Razor. Here's a sample code that should get you want you need:
#using System.Collections.Generic;
#using System.Linq;
#using ToSic.Eav.Data;
var type = AsEntity(Content).Type;
var attributes = type.Attributes as IEnumerable<IContentTypeAttribute>;
var typeOfAwards attributes.First(t => t.Name == "Awards").Type; // this will return "Entity"
I created a quick sample for you here: https://2sxc.org/dnn-tutorials/en/razor/data910/page

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

Using JSON.stringify but SSJS variant in XPages

for an application I am building an administration panel where a power user should be able to check the JSON structure of a selected object.
I would like to display the JSON object in a computed text field but display/format it nicely so it is better human readable, something similar as in pretty print.
Is there any function I could use in SSJS that results in something similar so I can use display json nicely in computed text / editable fields?
Use stringify's third parameter "space":
JSON.stringify(yourObject, null, ' ');
space
A String or Number object that's used to insert white
space into the output JSON string for readability purposes. If this is
a Number, it indicates the number of space characters to use as white
space; this number is capped at 10 if it's larger than that. Values
less than 1 indicate that no space should be used. If this is a
String, the string (or the first 10 characters of the string, if it's
longer than that) is used as white space. If this parameter is not
provided (or is null), no white space is used.
As XPages doesn't support JSON.stringify yet you can include JSON's definition as SSJS resource and use it.
As Knut points out, you can certainly add json2.js to XPages; I've previously used an implementation as Marky Roden's post outlines. This is probably the "safest" way of doing so, from the SSJS side of things.
It does ignore the included fromJson and toJson SSJS methods provided out of the box in XPages. While imperfect, they are functional, especially with the inclusion of Tommy Valand's fix snippet. Be advised, using Tommy's fix does wrap responses to ensure a proper JS object can be parsed by shoving an Array into an object with a values property for the array; so no direct pulling of an Array only.
Additionally, I believe it would be useful to point out that a bean, providing a convenience method or two as wrappers to use either the com.ibm.commons.util.io.json methods to abstract the conversion method, or switching in something like Google GSON, might be more powerful and unified, based on your style of development.
Knut, Eric, I came so far myself already.
function prettyPrint(id) {
var ugly = dojo.byId(id).value;
var obj = $.parseJSON( "[" + ugly + "]" );
var pretty = JSON.stringify(obj, undefined, 4);
dojo.byId(id).innerHTML = pretty;
}
and I call it e.g.
var name = x$('#{id:input-currentObjectCollectionFiltered}').attr("name");
prettyPrint(name);
I tried to make use the x$ function but was not able to make the ID dynamic there e.g.
var ugly = x$('#{id:" + id + "}').val();
not sure why. would be nicer if I just would call prettyPrint('input-currentObjectCollectionFiltered'); and the function would figure it out.
Instead of dojo.byId(id).value I tried:
var ugly=$("#" + id).val();
but things returns and undefined object: I thought jquery would be smarter to work with dynamic id's.
anyway stringify works just fine.

How to know if Perl Mojo::DOM::find returns or matches any DOM Element

Am currently parsing a series of webpages with Mojo::DOM and the only criterion for me to proceed down the web page is if there's an element found within.
I have my DOM object built like this:
my $urlMJ = Mojo::URL->new($entry->link);
my $tx = $ua->get($urlMJ);
my $base = $tx->req->url;
my $dom = $tx->res->dom;
my $divVideo = $dom->find('div#searchforme');
My question is, how do I know if $divVideo is empty?
I realise that from this question on google groups and grokbase answered by SRI (Riedel), if find doesn't match any element, it returns (if I get it correctly) the DOM object collection initiating the find and an empty DOM collection, which happens to be the result.
I thought of using an each to get to the empty DOM collection within, but won't the DOM returned contain the initial DOM structure?
I have tried using if (defined($divVideo)) , I also tried dumping with print Dumper($divVideo). All it returned was $VAR1 = bless( [], 'Mojo::Collection' );
I tried $dom->find('div#searchforme')->size , return values was 0 and even for those web pages that didn't fall into this category.
Can somebody please help me out?
Is my approach to this wrong?
if find doesn't match any element, it returns (if I get it correctly) the DOM object collection initiating the find and an empty DOM collection, which happens to be the result.
You're misunderstanding find. It returns just a Mojo::Collection of Mojo::DOM objects that represent each matching element in the page. Nothing else. So if no matches are found, just an empty collection is returned
This object has a size method, so you can say
my $divColln = $dom->find('div#searchforme');
if ( $divColln->size > 0 ) {
...
}
Alternatively you could use the each method to convert the collection into a list, and assign it to an array like this
my #divColln = $dom->find('div#searchforme')->each;
if ( #divColln ) {
...
}
Or if you are expecting to find just one such element (which it looks like you're doing here) then you can just pick the first item from the collection, like this
my $divVideo = $dom->find('div#searchforme')->[0];
if ( $divVideo ) {
...
}

Read all images in one column, include the base64 string result to image html tag. XPage

as you can see the code below it can translate the image data from lotus database to a base64 string. The problem is I manually put the file name of the image (line 4). I have a lots of images on my database and only my "btnbg.jpg" can read it, the others are not. How can my code can read all the image file names inside the database column. Also how can I include the result base64 string to my html image tag. Thank you so much and God bless
var testView:NotesView = database.getView("uploadforms");
var col:NotesDocumentCollection = testView.getAllDocumentsByKey("1");
var testDoc:NotesDocument = col.getFirstDocument();
var attachment:NotesEmbeddedObject = testDoc.getAttachment("btnbg.jpg");
var input:java.io.InputStream = attachment.getInputStream();
var base64Enc = new sun.misc.BASE64Encoder();
var output = new java.io.ByteArrayOutputStream();
base64Enc.encode( input, output );
return output.toString();
"How can my code can read all the image file names inside the database column"
You need to print attachment names in to the column. For example with the help of "#AttachmentNames" function.
You need o use "ViewNavigator" class to traverse column exactly
If you are prefer to work with document, then use some of the methods go get all attachment from document, like "EmbeddedObjects" method on the document and RT items.
"how can I include the result base64 string to my html image tag"
You could do it with the help of css: background:url(data:image/jpeg;base64,...
It's a bad idea to use a lot of pictures in the css base64.