where is the word/_rels/document.xml.rels in python docx object? - python-docx

I need the content of word/_rels/document.xml.rels to get the image infomation. Does python-docx store it?
I use this:
>>> from docx import Document as d
>>> x=d('a.docx')
there seems no way to get it in x object.

python-docx and python-pptx share a common opc subpackage; this is the docx.opc subpackage.
This layer abstracts the details of the .rels files, among other things.
You can get to it using:
>>> document = Document()
>>> document_part = document.part
>>> rels = document_part.rels
>>> for r in rels:
... print r.rId
'rId2'
'rId1'
'rId3'
How you use it most effectively depends on what you're trying to get at. Usually one just wants to get a related part and doesn't care about navigating the details of the packaging. For that there are these higher level methods:
docx.opc.part.Part.part_related_by()
docx.opc.part.Part.related_parts[rId]
In general the route from the object at hand is:
to the part it's contained in (often available on obj.part)
to the related part by use of .part_related_by() (using relationship type) or .related_parts[rId] (it's a dict).
back down the the API object via X_Part.main_obj e.g. DocumentPart.document
The areas in the code you might be interested in looking closer at are:
docx/parts/
docx/opc/part.py

Related

Embedding Altair htmls in google sites, how to `mark_image` using private google drive links?

I am trying to embed an interactive plot made using Altair into google-site. In this plot, I want to interactively display an image at a time that is stored on google-drive. When I gave this an attempt, mark_image failed silently, presumably because it did not read the image. This is not a surprise because google-drive images were private. With publically shared images I won't have this issue. For the purpose of this plot, I would like to keep the images private. Plus, there are a lot of images in total (~1K), so I probably should not encode them in data/bytes. I suspect that would probably make my HTML file very big and slow. Please correct me if I am wrong on this.
I wonder if mark_image could read the images from the google-drive links, probably using a "reader" of some sort (an upstream JS or python library), and then feed the read image to mark_image. If anybody has experience with this, solutions/suggestions/workarounds would be greatly helpful.
Here's a demo code to test this:
Case 1: Publically accessible image (no problem). Displayed using mark_image, saved in HTML format.
import altair as alt
import pandas as pd
path="https://vega.github.io/vega-datasets/data/gimp.png"
source = pd.DataFrame([{"x": 0, "y": 0, "img": path},])
chart=alt.Chart(source).mark_image(width=100,height=100,).encode(x='x',y='y',url='img')
chart.save('test.html')
Then I embed the HTML in a google-site (private, not shared to the public), using this option and then paste the content of the HTML file in the Embed code tab.
Case 2: Image on Google-drive (problem!). The case of an image stored on google-drive (private, not shared to the public).
# Please use the code above with `path` variable generated like this:
file_id='' # google drive file id
path=f"https://drive.google.com/uc?export=view&id={file_id}"
In this case, apparently mark_image fails silently and the image is not shown on the plot. ​
After searching for an optimal solution, I decided to rely on a sort of a workaround of encoding the images in data/bytes. This eliminates the issue of reading the URLs from google drive, which I could not find a solution for.
Encoding images in data/bytes, as I suspected, made the HTML big in size, however, surprisingly (to me) not slow to load at all. I guess that's the best thing I could do for what I wanted to do.
In the example below, get_data function obtains the data/bytes of an image. I put that into a column of the dataframe that is taken by Altair as input.
def plot_(images_from):
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(0)
n_objects = 20
n_times = 50
# Create one (x, y) pair of metadata per object
locations = pd.DataFrame({
'id': range(n_objects),
'x': np.random.randn(n_objects),
'y': np.random.randn(n_objects)
})
def get_data(p):
import base64
with open(p, "rb") as f:
return "data:image/jpeg;base64,"+base64.b64encode(f.read()).decode()
import urllib.request
if images_from=='url':
l1=[f"https://vega.github.io/vega-datasets/data/{k}.png" for k in ['ffox','7zip','gimp']]
elif images_from=='data':
l1=[get_data(urllib.request.urlretrieve(f"https://vega.github.io/vega-datasets/data/{k}.png",f'/tmp/{k}.png')[0]) for k in ['ffox','7zip','gimp']]
np.random.seed(0)
locations['img']=np.random.choice(l1, size=len(locations))
# Create a 50-element time-series for each object
timeseries = pd.DataFrame(np.random.randn(n_times, n_objects).cumsum(0),
columns=locations['id'],
index=pd.RangeIndex(0, n_times, name='time'))
# Melt the wide-form timeseries into a long-form view
timeseries = timeseries.reset_index().melt('time')
# Merge the (x, y) metadata into the long-form view
timeseries['id'] = timeseries['id'].astype(int) # make merge not complain
data = pd.merge(timeseries, locations, on='id')
# Data is prepared, now make a chart
selector = alt.selection_single(empty='none', fields=['id'])
base = alt.Chart(data).properties(
width=250,
height=250
).add_selection(selector)
points = base.mark_point(filled=True, size=200).encode(
x='mean(x)',
y='mean(y)',
color=alt.condition(selector, 'id:O', alt.value('lightgray'), legend=None),
)
timeseries = base.mark_line().encode(
x='time',
y=alt.Y('value', scale=alt.Scale(domain=(-15, 15))),
color=alt.Color('id:O', legend=None)
).transform_filter(
selector
)
images=base.mark_image(filled=True, size=200).encode(
x='x',
y='y',
url='img',
).transform_filter(
selector
)
chart=points | timeseries | images
chart.save(f'test/chart_images_{images_from}.html')
# generate htmls
plot_(images_from='url') # generate the HTML using URLs
plot_(images_from='data') # generate the HTML using data/bytes
The HTML made using the data was ~78 times bigger than the one made using URLs (~12Mb vs ~0.16Kb), but not noticeably slower.
Update: As I later found out google site does not allow embedding an HTML file of more than 1Mb size. So in the end, encoding the images did not really help.

Data frame error when converting iGraph to gexf object

I am trying to convert an iGraph object to a gexf object using the rgexf package so that I can write a file usable with Gephi, which I prefer for network visualization.
My iGraph object is created by reading in two CSVs: h.edges and h.nodes. There are both edge and node attributes. Once the files are read in, I create the iGraph object, calculate centrality measures and then attach the centrality measures as node attributes. The code looks like so:
iNet = graph_from_data_frame(d=h.edges, vertices = h.nodes, directed = F)
V(iNet)$degree = degree(iNet)
V(iNet)$eig = evcent(iNet)$vector
V(iNet)$betweenness = betweenness(iNet)
This appears to be working fine since I can do all the normal iGraph functions -- plot, calculate centralities, identify communities, etc. My problem comes when I try to convert this to a gexf object. I run the following code:
library(rgexf)
iNet.gexf igraph.to.gexf(iNet)
But get the below error message:
Error in `[.data.frame`(x, r, vars, drop = drop) :
undefined columns selected
Anyone know what's happening? Although I know the example here can all be done just by uploading the two CSVs straight to Gephi and running the calculations there, the end goal is to be able to attach iGraph's more robust calculations as attributes in ways that Gephi can't.

how to find centrality of nodes within clusters using i- graph and python

I'm working on network analysis and I'm new to python. I want to find out the centrality of every node within a cluster using i graph and python pandas.
I have tried the following:
Creating a graph:
tuples = [tuple(x) for x in data.values]
g=igraph.Graph.TupleList(tuples, directed = False,weights=True)
community detection using fast greedy algorithm:
fm = g.community_fastgreedy()
fm1 = fm.as_clustering()
clusters like this are formed:
[1549] 96650006, 966543799, 966500080
[1401] 96650006, 966567865, 966500069, 966500071
Now, I would like to get the eigenvalue centrality for each number within a cluster, so that i know which is the most important number within a cluster.
I am not very familiar with the eigenvector centrality in igraph, but here is the following solution I came up with:
# initial code is the same as yours
import numpy as np
# iterate over all created subgraphs created:
for subgraph in fm1.subgraphs():
# this is basically already what you want
cents = subgraph.eigenvector_centrality()
# get additionally the index of the respective vector
max_idx = np.argmax(cents)
print(subgraph.vs[max_idx]) # gets the correct vertex element.
Essentially, you want to utilize the option to access the created clusters as a graph (.subgraphs() allows you to do exactly that). The rest is then "just" simple manipulation of the graph object to get the element with the respective maximum eigenvector centrality.

How to add w:altChunk and its relationship with python-docx

I have a use case that make use of <w:altChunk/> element in Word document by inject (fragment of) HTML file as alternate chunks and let Word do it works when the file gets opened. The current implementation was using XML/XSL to compose WordML XML, modify relationships, and do all packaging stuffs manually which is a real pain.
I wanted to move to python-docx but the API doesn't support this directly. Currently I found a way to add the <w:altChunk/> in the document XML. But still struggle to find a way to add relationship and related file to the package.
I think I should make a compatible part and pass it to document.part.relate_to function to do its job. But still can't figure how to:
from docx import Document
from docx.oxml import OxmlElement, qn
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, chunk_part):
''' TODO: figuring how to add files and relationships'''
r_id = doc.part.relate_to(chunk_part, RT.A_F_CHUNK)
alt = OxmlElement('w:altChunk')
alt.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt)
Update:
As per scanny's advice, below is my working code. Thank you very much Steve!
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.opc.part import Part
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, html: str):
package = doc.part.package
partname = package.next_partname('/word/altChunk%d.html')
alt_part = Part(partname, 'text/html', html.encode(), package)
r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
alt_chunk = OxmlElement('w:altChunk')
alt_chunk.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt_chunk)
doc = Document()
doc.add_paragraph('Hello')
add_alt_chunk(doc, "<body><strong>I'm an altChunk</strong></body>")
doc.add_paragraph('Have a nice day!')
doc.save('test.docx')
Note: the altChunk parts only work/appear when document is open using MS Word
Well, some hints here anyway. Maybe you can post your working code at the end as a full "answer":
The alt-chunk part needs to start its life as a docx.opc.part.Part object.
The blob argument should be the bytes of the file, which is often but not always plain text. It must be bytes though, not unicode (characters), so any encoding has to happen before calling Part().
I expect you can work out the other arguments:
package is the overall OPC package, available on document.part.package.
You can use docx.opc.package.OpcPackage.next_partname() to get an available partname based on a root template like: "altChunk%s" for a name like "altChunk3". Check what partname prefix Word uses for these, possibly with unzip -l has-an-alt-chunk.docx; should be easy to spot.
The content-type is one in docx.opc.constants.CONTENT_TYPE. Check the [Content_Types].xml part in a .docx file that has an altChunk to see what they use.
Once formed, the document_part.relate_to() method will create the proper relationship. If there is more than one relationship (not common) then you need to create each one separately. There would only be one relationship from a particular part, just some parts are related to more than one other part. Check the relationships in an existing .docx to see, but pretty good guess it's only the one in this case.
So your code would look something like:
package = document.part.package
partname = package.next_partname("altChunkySomethingPrefix")
content_type = docx.opc.constants.CONTENT_TYPE.THE_RIGHT_MIME_TYPE
blob = make_the_altChunk_file_bytes()
alt_chunk_part = Part(partname, content_type, blob, package)
rId = document.part.relate_to(alt_chunk_part, RT.A_F_CHUNK)
etc.

What is the difference between EmbeddedDocumentField and ReferenceField in mongoengine

Internally, what are the differences between these two fields? What kind of schema do these fields map to in mongo? Also, how should documents with relations be added to these fields? For example, if I use
from mongoengine import *
class User(Document):
name = StringField()
class Comment(EmbeddedDocument):
text = StringField()
tag = StringField()
class Post(Document):
title = StringField()
author = ReferenceField(User)
comments = ListField(EmbeddedDocumentField(Comment))
and call
>>> some_author = User.objects.get(name="ExampleUserName")
>>> post = Post.objects.get(author=some_author)
>>> post.comments
[]
>>> comment = Comment(text="cool post", tag="django")
>>> comment.save()
>>>
should I use post.comments.append(comment) or post.comments += comment for appending this document? My original question stems from this confusion as to how I should handle this.
EmbeddedDocumentField is just path of parent document like DictField and stored in one record with parent document in mongo.
To save EmbeddedDocument just save parent document.
>>> some_author = User.objects.get(name="ExampleUserName")
>>> post = Post.objects.get(author=some_author)
>>> post.comments
[]
>>> comment = Comment(text="cool post", tag="django")
>>> post.comment.append(comment)
>>> post.save()
>>> post.comment
[<Comment object __unicode__>]
>>> Post.objects.get(author=some_author).comment
[<Comment object __unicode__>]
See documentation: http://docs.mongoengine.org/guide/defining-documents.html#embedded-documents.
This one just a sample case where we can use embedded docs.
Lets say for example you are going to create an app that takes in requirements as they come in and save them in the db.
Now your requirement is to assign this requirement to a bunch of people each at a later stage after some processing of the requirement.
you also need to track the changes and log the activity pertaining to the processing taken place with regards to the requirement.
I know i know you might say we can use rdbms kind of relationship with refference field. but it involves in taking care of deleting obselete records in either collections which is nothing but extra code to handle the maintenance of the child collection in case of parent doc being deleted.( There are other extra efforts too that come into place ..)
instead embedded documents are stored as part of the parent doc. which Maintaining parent will involve embedded docs too.
and it will be easy to create complex json structured data using embedded docs rather than using user logic to manipulate and process the data into a complex structure.
Now Here the relation is one requirement to many handlers(which is nothing but an activity log by the handlers for the one requirement).