Embedding Altair htmls in google sites, how to `mark_image` using private google drive links? - google-drive-api

I am trying to embed an interactive plot made using Altair into google-site. In this plot, I want to interactively display an image at a time that is stored on google-drive. When I gave this an attempt, mark_image failed silently, presumably because it did not read the image. This is not a surprise because google-drive images were private. With publically shared images I won't have this issue. For the purpose of this plot, I would like to keep the images private. Plus, there are a lot of images in total (~1K), so I probably should not encode them in data/bytes. I suspect that would probably make my HTML file very big and slow. Please correct me if I am wrong on this.
I wonder if mark_image could read the images from the google-drive links, probably using a "reader" of some sort (an upstream JS or python library), and then feed the read image to mark_image. If anybody has experience with this, solutions/suggestions/workarounds would be greatly helpful.
Here's a demo code to test this:
Case 1: Publically accessible image (no problem). Displayed using mark_image, saved in HTML format.
import altair as alt
import pandas as pd
path="https://vega.github.io/vega-datasets/data/gimp.png"
source = pd.DataFrame([{"x": 0, "y": 0, "img": path},])
chart=alt.Chart(source).mark_image(width=100,height=100,).encode(x='x',y='y',url='img')
chart.save('test.html')
Then I embed the HTML in a google-site (private, not shared to the public), using this option and then paste the content of the HTML file in the Embed code tab.
Case 2: Image on Google-drive (problem!). The case of an image stored on google-drive (private, not shared to the public).
# Please use the code above with `path` variable generated like this:
file_id='' # google drive file id
path=f"https://drive.google.com/uc?export=view&id={file_id}"
In this case, apparently mark_image fails silently and the image is not shown on the plot. ​

After searching for an optimal solution, I decided to rely on a sort of a workaround of encoding the images in data/bytes. This eliminates the issue of reading the URLs from google drive, which I could not find a solution for.
Encoding images in data/bytes, as I suspected, made the HTML big in size, however, surprisingly (to me) not slow to load at all. I guess that's the best thing I could do for what I wanted to do.
In the example below, get_data function obtains the data/bytes of an image. I put that into a column of the dataframe that is taken by Altair as input.
def plot_(images_from):
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(0)
n_objects = 20
n_times = 50
# Create one (x, y) pair of metadata per object
locations = pd.DataFrame({
'id': range(n_objects),
'x': np.random.randn(n_objects),
'y': np.random.randn(n_objects)
})
def get_data(p):
import base64
with open(p, "rb") as f:
return "data:image/jpeg;base64,"+base64.b64encode(f.read()).decode()
import urllib.request
if images_from=='url':
l1=[f"https://vega.github.io/vega-datasets/data/{k}.png" for k in ['ffox','7zip','gimp']]
elif images_from=='data':
l1=[get_data(urllib.request.urlretrieve(f"https://vega.github.io/vega-datasets/data/{k}.png",f'/tmp/{k}.png')[0]) for k in ['ffox','7zip','gimp']]
np.random.seed(0)
locations['img']=np.random.choice(l1, size=len(locations))
# Create a 50-element time-series for each object
timeseries = pd.DataFrame(np.random.randn(n_times, n_objects).cumsum(0),
columns=locations['id'],
index=pd.RangeIndex(0, n_times, name='time'))
# Melt the wide-form timeseries into a long-form view
timeseries = timeseries.reset_index().melt('time')
# Merge the (x, y) metadata into the long-form view
timeseries['id'] = timeseries['id'].astype(int) # make merge not complain
data = pd.merge(timeseries, locations, on='id')
# Data is prepared, now make a chart
selector = alt.selection_single(empty='none', fields=['id'])
base = alt.Chart(data).properties(
width=250,
height=250
).add_selection(selector)
points = base.mark_point(filled=True, size=200).encode(
x='mean(x)',
y='mean(y)',
color=alt.condition(selector, 'id:O', alt.value('lightgray'), legend=None),
)
timeseries = base.mark_line().encode(
x='time',
y=alt.Y('value', scale=alt.Scale(domain=(-15, 15))),
color=alt.Color('id:O', legend=None)
).transform_filter(
selector
)
images=base.mark_image(filled=True, size=200).encode(
x='x',
y='y',
url='img',
).transform_filter(
selector
)
chart=points | timeseries | images
chart.save(f'test/chart_images_{images_from}.html')
# generate htmls
plot_(images_from='url') # generate the HTML using URLs
plot_(images_from='data') # generate the HTML using data/bytes
The HTML made using the data was ~78 times bigger than the one made using URLs (~12Mb vs ~0.16Kb), but not noticeably slower.
Update: As I later found out google site does not allow embedding an HTML file of more than 1Mb size. So in the end, encoding the images did not really help.

Related

Analyzing user driving pattern data

so I have a lot of GPXs of users driving data from a game project where object which are placed on the road and then the user collects it. I want to somehow analyze these data to find out how users tend to drive given different objects, which ones draw them the most, which ones draw least. I have not done any data analysis before, so how can I analyze these data to get this sort of information? This might sound very novice, but yeah any help is appreciated.
You would probably like to do this in Python if you are novice, and then you can use a library like this one (gpxpy) to explore your data.
That is a GPX parser, I believe it will provide you with the data you like to see.
In their documentation you can see that you can use it like that :
import gpxpy
import gpxpy.gpx
# Open a file
gpx_file = open('yourfile.gpx', 'r')
# Parse the file
gpx = gpxpy.parse(gpx_file)
# Iterate over the tracks
for track in gpx.tracks:
for segment in track.segments:
for pt in segment.points:
print(f'Point at ({pt.latitude},{pt.longitude}) -> {pt.elevation}')
for waypoint in gpx.waypoints:
print(f'waypoint {waypoint.name} -> ({waypoint.latitude},{waypoint.longitude})')
for route in gpx.routes:
print('Route:')
for pt in route.points:
print(f'Point at ({pt.latitude},{pt.longitude}) -> {pt.elevation}')
Once you have those points you can calculate the distances, speeds, etc. from the coordinates.

How can I process large files in Code Repositories?

I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.

How to add w:altChunk and its relationship with python-docx

I have a use case that make use of <w:altChunk/> element in Word document by inject (fragment of) HTML file as alternate chunks and let Word do it works when the file gets opened. The current implementation was using XML/XSL to compose WordML XML, modify relationships, and do all packaging stuffs manually which is a real pain.
I wanted to move to python-docx but the API doesn't support this directly. Currently I found a way to add the <w:altChunk/> in the document XML. But still struggle to find a way to add relationship and related file to the package.
I think I should make a compatible part and pass it to document.part.relate_to function to do its job. But still can't figure how to:
from docx import Document
from docx.oxml import OxmlElement, qn
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, chunk_part):
''' TODO: figuring how to add files and relationships'''
r_id = doc.part.relate_to(chunk_part, RT.A_F_CHUNK)
alt = OxmlElement('w:altChunk')
alt.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt)
Update:
As per scanny's advice, below is my working code. Thank you very much Steve!
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.opc.part import Part
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, html: str):
package = doc.part.package
partname = package.next_partname('/word/altChunk%d.html')
alt_part = Part(partname, 'text/html', html.encode(), package)
r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
alt_chunk = OxmlElement('w:altChunk')
alt_chunk.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt_chunk)
doc = Document()
doc.add_paragraph('Hello')
add_alt_chunk(doc, "<body><strong>I'm an altChunk</strong></body>")
doc.add_paragraph('Have a nice day!')
doc.save('test.docx')
Note: the altChunk parts only work/appear when document is open using MS Word
Well, some hints here anyway. Maybe you can post your working code at the end as a full "answer":
The alt-chunk part needs to start its life as a docx.opc.part.Part object.
The blob argument should be the bytes of the file, which is often but not always plain text. It must be bytes though, not unicode (characters), so any encoding has to happen before calling Part().
I expect you can work out the other arguments:
package is the overall OPC package, available on document.part.package.
You can use docx.opc.package.OpcPackage.next_partname() to get an available partname based on a root template like: "altChunk%s" for a name like "altChunk3". Check what partname prefix Word uses for these, possibly with unzip -l has-an-alt-chunk.docx; should be easy to spot.
The content-type is one in docx.opc.constants.CONTENT_TYPE. Check the [Content_Types].xml part in a .docx file that has an altChunk to see what they use.
Once formed, the document_part.relate_to() method will create the proper relationship. If there is more than one relationship (not common) then you need to create each one separately. There would only be one relationship from a particular part, just some parts are related to more than one other part. Check the relationships in an existing .docx to see, but pretty good guess it's only the one in this case.
So your code would look something like:
package = document.part.package
partname = package.next_partname("altChunkySomethingPrefix")
content_type = docx.opc.constants.CONTENT_TYPE.THE_RIGHT_MIME_TYPE
blob = make_the_altChunk_file_bytes()
alt_chunk_part = Part(partname, content_type, blob, package)
rId = document.part.relate_to(alt_chunk_part, RT.A_F_CHUNK)
etc.

How to load image from csv file in tensorflow

I have image save in 0.csv files.
The format is as picture below.
How can I read it to tensorflow?
Thanks!
You should use the Dataset input pipeline introduced in tensorflow 1.4:
https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
Here's the example from the developers guide (though you'll want to read through that guide, it's quite well written):
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))
The Dataset preprocessing pipeline has a few nice advantages. Most of the functionality you'll need such as reading text records, shuffling, batching, etc. are reduced to one-liners. More importantly though, it forces you into writing your preprocessing pipeline in a good, modular, testable way. It takes a little bit to get used to the API, but it's time well spent.

where is the word/_rels/document.xml.rels in python docx object?

I need the content of word/_rels/document.xml.rels to get the image infomation. Does python-docx store it?
I use this:
>>> from docx import Document as d
>>> x=d('a.docx')
there seems no way to get it in x object.
python-docx and python-pptx share a common opc subpackage; this is the docx.opc subpackage.
This layer abstracts the details of the .rels files, among other things.
You can get to it using:
>>> document = Document()
>>> document_part = document.part
>>> rels = document_part.rels
>>> for r in rels:
... print r.rId
'rId2'
'rId1'
'rId3'
How you use it most effectively depends on what you're trying to get at. Usually one just wants to get a related part and doesn't care about navigating the details of the packaging. For that there are these higher level methods:
docx.opc.part.Part.part_related_by()
docx.opc.part.Part.related_parts[rId]
In general the route from the object at hand is:
to the part it's contained in (often available on obj.part)
to the related part by use of .part_related_by() (using relationship type) or .related_parts[rId] (it's a dict).
back down the the API object via X_Part.main_obj e.g. DocumentPart.document
The areas in the code you might be interested in looking closer at are:
docx/parts/
docx/opc/part.py