I have a use case that make use of <w:altChunk/> element in Word document by inject (fragment of) HTML file as alternate chunks and let Word do it works when the file gets opened. The current implementation was using XML/XSL to compose WordML XML, modify relationships, and do all packaging stuffs manually which is a real pain.
I wanted to move to python-docx but the API doesn't support this directly. Currently I found a way to add the <w:altChunk/> in the document XML. But still struggle to find a way to add relationship and related file to the package.
I think I should make a compatible part and pass it to document.part.relate_to function to do its job. But still can't figure how to:
from docx import Document
from docx.oxml import OxmlElement, qn
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, chunk_part):
''' TODO: figuring how to add files and relationships'''
r_id = doc.part.relate_to(chunk_part, RT.A_F_CHUNK)
alt = OxmlElement('w:altChunk')
alt.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt)
Update:
As per scanny's advice, below is my working code. Thank you very much Steve!
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.opc.part import Part
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, html: str):
package = doc.part.package
partname = package.next_partname('/word/altChunk%d.html')
alt_part = Part(partname, 'text/html', html.encode(), package)
r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
alt_chunk = OxmlElement('w:altChunk')
alt_chunk.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt_chunk)
doc = Document()
doc.add_paragraph('Hello')
add_alt_chunk(doc, "<body><strong>I'm an altChunk</strong></body>")
doc.add_paragraph('Have a nice day!')
doc.save('test.docx')
Note: the altChunk parts only work/appear when document is open using MS Word
Well, some hints here anyway. Maybe you can post your working code at the end as a full "answer":
The alt-chunk part needs to start its life as a docx.opc.part.Part object.
The blob argument should be the bytes of the file, which is often but not always plain text. It must be bytes though, not unicode (characters), so any encoding has to happen before calling Part().
I expect you can work out the other arguments:
package is the overall OPC package, available on document.part.package.
You can use docx.opc.package.OpcPackage.next_partname() to get an available partname based on a root template like: "altChunk%s" for a name like "altChunk3". Check what partname prefix Word uses for these, possibly with unzip -l has-an-alt-chunk.docx; should be easy to spot.
The content-type is one in docx.opc.constants.CONTENT_TYPE. Check the [Content_Types].xml part in a .docx file that has an altChunk to see what they use.
Once formed, the document_part.relate_to() method will create the proper relationship. If there is more than one relationship (not common) then you need to create each one separately. There would only be one relationship from a particular part, just some parts are related to more than one other part. Check the relationships in an existing .docx to see, but pretty good guess it's only the one in this case.
So your code would look something like:
package = document.part.package
partname = package.next_partname("altChunkySomethingPrefix")
content_type = docx.opc.constants.CONTENT_TYPE.THE_RIGHT_MIME_TYPE
blob = make_the_altChunk_file_bytes()
alt_chunk_part = Part(partname, content_type, blob, package)
rId = document.part.relate_to(alt_chunk_part, RT.A_F_CHUNK)
etc.
Related
I am trying to embed an interactive plot made using Altair into google-site. In this plot, I want to interactively display an image at a time that is stored on google-drive. When I gave this an attempt, mark_image failed silently, presumably because it did not read the image. This is not a surprise because google-drive images were private. With publically shared images I won't have this issue. For the purpose of this plot, I would like to keep the images private. Plus, there are a lot of images in total (~1K), so I probably should not encode them in data/bytes. I suspect that would probably make my HTML file very big and slow. Please correct me if I am wrong on this.
I wonder if mark_image could read the images from the google-drive links, probably using a "reader" of some sort (an upstream JS or python library), and then feed the read image to mark_image. If anybody has experience with this, solutions/suggestions/workarounds would be greatly helpful.
Here's a demo code to test this:
Case 1: Publically accessible image (no problem). Displayed using mark_image, saved in HTML format.
import altair as alt
import pandas as pd
path="https://vega.github.io/vega-datasets/data/gimp.png"
source = pd.DataFrame([{"x": 0, "y": 0, "img": path},])
chart=alt.Chart(source).mark_image(width=100,height=100,).encode(x='x',y='y',url='img')
chart.save('test.html')
Then I embed the HTML in a google-site (private, not shared to the public), using this option and then paste the content of the HTML file in the Embed code tab.
Case 2: Image on Google-drive (problem!). The case of an image stored on google-drive (private, not shared to the public).
# Please use the code above with `path` variable generated like this:
file_id='' # google drive file id
path=f"https://drive.google.com/uc?export=view&id={file_id}"
In this case, apparently mark_image fails silently and the image is not shown on the plot.
After searching for an optimal solution, I decided to rely on a sort of a workaround of encoding the images in data/bytes. This eliminates the issue of reading the URLs from google drive, which I could not find a solution for.
Encoding images in data/bytes, as I suspected, made the HTML big in size, however, surprisingly (to me) not slow to load at all. I guess that's the best thing I could do for what I wanted to do.
In the example below, get_data function obtains the data/bytes of an image. I put that into a column of the dataframe that is taken by Altair as input.
def plot_(images_from):
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(0)
n_objects = 20
n_times = 50
# Create one (x, y) pair of metadata per object
locations = pd.DataFrame({
'id': range(n_objects),
'x': np.random.randn(n_objects),
'y': np.random.randn(n_objects)
})
def get_data(p):
import base64
with open(p, "rb") as f:
return "data:image/jpeg;base64,"+base64.b64encode(f.read()).decode()
import urllib.request
if images_from=='url':
l1=[f"https://vega.github.io/vega-datasets/data/{k}.png" for k in ['ffox','7zip','gimp']]
elif images_from=='data':
l1=[get_data(urllib.request.urlretrieve(f"https://vega.github.io/vega-datasets/data/{k}.png",f'/tmp/{k}.png')[0]) for k in ['ffox','7zip','gimp']]
np.random.seed(0)
locations['img']=np.random.choice(l1, size=len(locations))
# Create a 50-element time-series for each object
timeseries = pd.DataFrame(np.random.randn(n_times, n_objects).cumsum(0),
columns=locations['id'],
index=pd.RangeIndex(0, n_times, name='time'))
# Melt the wide-form timeseries into a long-form view
timeseries = timeseries.reset_index().melt('time')
# Merge the (x, y) metadata into the long-form view
timeseries['id'] = timeseries['id'].astype(int) # make merge not complain
data = pd.merge(timeseries, locations, on='id')
# Data is prepared, now make a chart
selector = alt.selection_single(empty='none', fields=['id'])
base = alt.Chart(data).properties(
width=250,
height=250
).add_selection(selector)
points = base.mark_point(filled=True, size=200).encode(
x='mean(x)',
y='mean(y)',
color=alt.condition(selector, 'id:O', alt.value('lightgray'), legend=None),
)
timeseries = base.mark_line().encode(
x='time',
y=alt.Y('value', scale=alt.Scale(domain=(-15, 15))),
color=alt.Color('id:O', legend=None)
).transform_filter(
selector
)
images=base.mark_image(filled=True, size=200).encode(
x='x',
y='y',
url='img',
).transform_filter(
selector
)
chart=points | timeseries | images
chart.save(f'test/chart_images_{images_from}.html')
# generate htmls
plot_(images_from='url') # generate the HTML using URLs
plot_(images_from='data') # generate the HTML using data/bytes
The HTML made using the data was ~78 times bigger than the one made using URLs (~12Mb vs ~0.16Kb), but not noticeably slower.
Update: As I later found out google site does not allow embedding an HTML file of more than 1Mb size. So in the end, encoding the images did not really help.
When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.
PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.
there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params
TLDR
I am looking for an existing convention to encode / serialize tree-like data in a directory structure, split into small files instead of one big file.
Background
There are different scenarios where we want to store tree-like data in a file, which can then be tracked in git. Json files can express dependencies for a package manager (e.g. composer for php, npm for node.js). Yml files can define routes, test cases, etc.
Typically a "tree structure" is a combination of key-value lists and "serial" lists, where each value can again be a tree structure.
Very often the order of associative keys is irrelevant, and should ideally be normalized to alphabetic order.
One problem when storing a big tree structure in a single file, be it json or yml, which is then tracked with git, is that you get plenty of merge conflicts if different branches add and remove entries in the same key-value list.
Especially for key-value lists where the order is irrelevant, it would be more git-friendly to store each sub-tree in a separate file or directory, instead of storing them all in one big file.
Technically it should be possible to create a directory structure that is as expressive as json or yml.
Performance concerns can be overcome with caching. If the files are going to be tracked in git, we can assume they are going to be unchanged most of the time.
The main challenges:
- How to deal with "special characters" that cause problems in some or most file systems, if used in a file or directory name?
- If I need to encode or disambiguate special characters, how can I still keep it pleasant to the eye?
- How to deal with limitations to file name length in some file systems?
- How to deal with other file system quirks, e.g. case insensitivity? Is this even still a thing?
- How to express serial lists, which might contain key-value lists as children? Serial lists cannot be expressed as directories, so its children have to live within the same file.
- How can I avoid to reinvent the wheel, creating my own made-up "convention" that nobody else uses?
Desired features:
- As expressive as json or yml.
- git-friendly.
- Machine-readable and -writable.
- Human-readable and -editable, perhaps with limitations.
- Ideally it should use known formats (json, yml) for structures and values that are expressed within a single file.
Naive approach
Of course the first idea would be to use yml files for literal values and serial lists, and directories for key-value lists (in cases where the order does not matter). In a key-value list, the file or directory names are interpreted as keys, the files and subdirectories as values.
This has some limitations, because not every possible key that would be valid in json or yml is also a valid file name in every file system. The most obvious example would be a slash.
Question
I have different ideas how I would do this myself.
But I am really looking for some kind of convention for this that already exists.
Related questions
Persistence: Data Trees stored as Directory Trees
This is asking about performance, and about using the filesystem like a database - I think.
I am less interested in performance (caching makes it irrelevant), and more about the actual storage format / convention.
The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/ instead of having a single file /etc/exim/exim4.conf. There are multiple reasons doing this:
Some configuration may be provided by the package manager (e.g. linking to other services that are installed via package manager), while other parts are user-defined. Since there would be a conflict if the user edits a file provided by the package manager, they can instead create a new file and enter additional configuration there.
For large configuration files (like for exim4), it's easier to navigate the configuration if you have multiple files for different concerns (hardcore vim users might disagree).
You can enable / disable parts of the configuration easier by renaming / moving the file that contains a certain part.
We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.
You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.
Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):
spam:
spam: spam
egg: sausage
baked beans:
- spam
- spam
- bacon
A possible transformation would be this (read stuff ending with / as directory, : starts file content):
confdir/
main.yaml:
spam: !include spammap/main.yaml
baked beans: !include beans/
spammap/
main.yaml:
spam: !include spam.yaml
egg: !include egg.yaml
spam.yaml:
spam
egg.yaml:
sausage
beans/
1.yaml:
spam
2.yaml:
spam
3.yaml:
bacon
(In YAML, !include is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)
As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans -> beans/) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml). This would be more quirky if you put content into the file names.
I named the starting file on each level main.yaml (not needed in beans/ since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py or the Nix package manager with default.nix. Then I placed additional files or directories besides this main file.
Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}.
To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.
Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.
import os.path
from pathlib import Path
import yaml
from yaml.reader import Reader
from yaml.scanner import Scanner
from yaml.parser import Parser
from yaml.composer import Composer
from yaml.constructor import SafeConstructor
from yaml.resolver import Resolver
from yaml.emitter import Emitter
from yaml.serializer import Serializer
from yaml.representer import SafeRepresenter
class SplitValue(object):
"""This is a value that should be written into its own YAML file."""
def __init__(self, content, path = None):
self._content = content
self._path = path
def getval(self):
return self._content
def setval(self, value):
self._content = value
def __repr__(self):
return self._content.__repr__()
class TransparentContainer(object):
"""Makes SplitValues transparent to the user."""
def __getitem__(self, key):
val = super(TransparentContainer, self).__getitem__(key)
return val.getval() if isinstance(val, SplitValue) else val
def __setitem__(self, key, value):
val = super(TransparentContainer, self).__getitem__(key)
if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
val.setval(value)
else:
super(TransparentContainer, self).__setitem__(key, value)
class TransparentList(TransparentContainer, list):
pass
class TransparentDict(TransparentContainer, dict):
pass
class DirectoryAwareFileProcessor(object):
def __init__(self, path, mode):
self._basedir = os.path.dirname(path)
self._file = open(path, mode)
def close(self):
try:
self._file.close()
finally:
self.dispose() # implemented by PyYAML
# __enter__ / __exit__ to use this in a `with` construct
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.close()
class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
Parser, Composer, SafeConstructor, Resolver):
"""Loads YAML file from a directory structure."""
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'r')
Reader.__init__(self, self._file)
Scanner.__init__(self)
Parser.__init__(self)
Composer.__init__(self)
SafeConstructor.__init__(self)
Resolver.__init__(self)
def split_value_constructor(loader, node):
path = loader.construct_scalar(node)
with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
return SplitValue(childLoader.get_single_data(), path)
FilesystemLoader.add_constructor(u'!include', split_value_constructor)
def transp_dict_constructor(loader, node):
ret = TransparentDict()
ret.update(loader.construct_mapping(node, deep=True))
return ret
# override constructor for !!map, the default resolved tag for mappings
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
transp_dict_constructor)
def transp_list_constructor(loader, node):
ret = TransparentList()
ret.append(loader.construct_sequence(node, deep=True))
return ret
# like above, for !!seq
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
transp_list_constructor)
class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
Serializer, SafeRepresenter, Resolver):
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'w')
Emitter.__init__(self, self._file)
Serializer.__init__(self)
SafeRepresenter.__init__(self)
Resolver.__init__(self)
self.__next_unique_name = 1
Serializer.open(self)
def gen_unique_name(self):
val = self.__next_unique_name
self.__next_unique_name = self.__next_unique_name + 1
return str(val)
def close(self):
try:
Serializer.close(self)
finally:
DirectoryAwareFileProcessor.close(self)
def split_value_representer(dumper, data):
if data._path is None:
if isinstance(data._content, TransparentContainer):
data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
else:
data._path = dumper.gen_unique_name() + ".yaml"
Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
childDumper.represent(data._content)
return dumper.represent_scalar(u'!include', data._path)
yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)
def transp_dict_representer(dumper, data):
return dumper.represent_dict(data)
yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)
def transp_list_representer(dumper, data):
return dumper.represent_list(data)
# example usage:
# explicitly specify values that should be split.
myData = TransparentDict({
"spam": SplitValue({
"spam": SplitValue("spam", "spam.yaml"),
"egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(myData)
# load values from stored files.
# The loaded data remembers which values have been in which files.
with FilesystemLoader("root.yaml") as loader:
loaded = loader.get_single_data()
# modify a value as if it was a normal structure.
# actually updates a SplitValue
loaded["spam"]["spam"] = "baked beans"
# dumps the same structure as before, with the modified value.
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(loaded)
Internally, what are the differences between these two fields? What kind of schema do these fields map to in mongo? Also, how should documents with relations be added to these fields? For example, if I use
from mongoengine import *
class User(Document):
name = StringField()
class Comment(EmbeddedDocument):
text = StringField()
tag = StringField()
class Post(Document):
title = StringField()
author = ReferenceField(User)
comments = ListField(EmbeddedDocumentField(Comment))
and call
>>> some_author = User.objects.get(name="ExampleUserName")
>>> post = Post.objects.get(author=some_author)
>>> post.comments
[]
>>> comment = Comment(text="cool post", tag="django")
>>> comment.save()
>>>
should I use post.comments.append(comment) or post.comments += comment for appending this document? My original question stems from this confusion as to how I should handle this.
EmbeddedDocumentField is just path of parent document like DictField and stored in one record with parent document in mongo.
To save EmbeddedDocument just save parent document.
>>> some_author = User.objects.get(name="ExampleUserName")
>>> post = Post.objects.get(author=some_author)
>>> post.comments
[]
>>> comment = Comment(text="cool post", tag="django")
>>> post.comment.append(comment)
>>> post.save()
>>> post.comment
[<Comment object __unicode__>]
>>> Post.objects.get(author=some_author).comment
[<Comment object __unicode__>]
See documentation: http://docs.mongoengine.org/guide/defining-documents.html#embedded-documents.
This one just a sample case where we can use embedded docs.
Lets say for example you are going to create an app that takes in requirements as they come in and save them in the db.
Now your requirement is to assign this requirement to a bunch of people each at a later stage after some processing of the requirement.
you also need to track the changes and log the activity pertaining to the processing taken place with regards to the requirement.
I know i know you might say we can use rdbms kind of relationship with refference field. but it involves in taking care of deleting obselete records in either collections which is nothing but extra code to handle the maintenance of the child collection in case of parent doc being deleted.( There are other extra efforts too that come into place ..)
instead embedded documents are stored as part of the parent doc. which Maintaining parent will involve embedded docs too.
and it will be easy to create complex json structured data using embedded docs rather than using user logic to manipulate and process the data into a complex structure.
Now Here the relation is one requirement to many handlers(which is nothing but an activity log by the handlers for the one requirement).
I am trying to scrape an html table and save its data in a database. What strategies/solutions have you found to be helpful in approaching this program.
I'm most comfortable with Java and PHP but really a solution in any language would be helpful.
EDIT: For more detail, the UTA (Salt Lake's Bus system) provides bus schedules on its website. Each schedule appears in a table that has stations in the header and times of departure in the rows. I would like to go through the schedules and save the information in the table in a form that I can then query.
Here's the starting point for the schedules
It all depends on how properly your HTML to scrape is? If it's valid XHTML, you can simply use some XPath queries on it to get whatever you want.
Example of xpath in php: http://blogoscoped.com/archive/2004_06_23_index.html#108802750834787821
A helper class to scrape a table into an array: http://www.tgreer.com/class_http_php.html
There is a nice book about this topic: Spidering Hacks by Kevin Hemenway and Tara Calishain.
I've found that scripting languages are generally better suited for doing such tasks. I personally prefer Python, but PHP will work as well. Chopping, mincing and parsing strings in Java is just too much work.
I have tried screen-scraping before, but I found it to be very brittle, especially with dynamically-generated code.
I found a third-party DOM-parser and used it to navigate the source code with Regex-like matching patterns in order to find the data I needed.
I suggested trying to find out if the owners of the site have a published API (often Web Services) for retrieving data from their system. If not, then good luck to you.
If what you want is a form a csv table then you can use this:
using python:
for example imagine you want to scrape forex quotes in csv form from some site like: fxoanda
then...
from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace
date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()
once you have it in this form you can convert the data to any form you like.
At the risk of starting a shitstorm here on SO, I'd suggest that if the format of the table never changes, you could just about get away with using Regularexpressions to parse and capture the content you need.
pianohacker overlooked the HTML::TableExtract module, which was designed for exactly this sort of thing. You'd still need LWP to retrieve the table.
This would be by far the easiest with Perl, and the following CPAN modules:
http://metacpan.org/pod/HTML::Parser
http://metacpan.org/pod/LWP
http://metacpan.org/pod/DBD/mysql
http://metacpan.org/pod/DBI.pm
CPAN being the main distribution mechanism for Perl modules, and accessible by running the following shell command, for example:
# cpan HTML::Parser
If you're on Windows, things will be more interesting, but you can still do it: http://www.perlmonks.org/?node_id=583586