I am trying to scrape an html table and save its data in a database. What strategies/solutions have you found to be helpful in approaching this program.
I'm most comfortable with Java and PHP but really a solution in any language would be helpful.
EDIT: For more detail, the UTA (Salt Lake's Bus system) provides bus schedules on its website. Each schedule appears in a table that has stations in the header and times of departure in the rows. I would like to go through the schedules and save the information in the table in a form that I can then query.
Here's the starting point for the schedules
It all depends on how properly your HTML to scrape is? If it's valid XHTML, you can simply use some XPath queries on it to get whatever you want.
Example of xpath in php: http://blogoscoped.com/archive/2004_06_23_index.html#108802750834787821
A helper class to scrape a table into an array: http://www.tgreer.com/class_http_php.html
There is a nice book about this topic: Spidering Hacks by Kevin Hemenway and Tara Calishain.
I've found that scripting languages are generally better suited for doing such tasks. I personally prefer Python, but PHP will work as well. Chopping, mincing and parsing strings in Java is just too much work.
I have tried screen-scraping before, but I found it to be very brittle, especially with dynamically-generated code.
I found a third-party DOM-parser and used it to navigate the source code with Regex-like matching patterns in order to find the data I needed.
I suggested trying to find out if the owners of the site have a published API (often Web Services) for retrieving data from their system. If not, then good luck to you.
If what you want is a form a csv table then you can use this:
using python:
for example imagine you want to scrape forex quotes in csv form from some site like: fxoanda
then...
from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace
date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()
once you have it in this form you can convert the data to any form you like.
At the risk of starting a shitstorm here on SO, I'd suggest that if the format of the table never changes, you could just about get away with using Regularexpressions to parse and capture the content you need.
pianohacker overlooked the HTML::TableExtract module, which was designed for exactly this sort of thing. You'd still need LWP to retrieve the table.
This would be by far the easiest with Perl, and the following CPAN modules:
http://metacpan.org/pod/HTML::Parser
http://metacpan.org/pod/LWP
http://metacpan.org/pod/DBD/mysql
http://metacpan.org/pod/DBI.pm
CPAN being the main distribution mechanism for Perl modules, and accessible by running the following shell command, for example:
# cpan HTML::Parser
If you're on Windows, things will be more interesting, but you can still do it: http://www.perlmonks.org/?node_id=583586
Related
I have a use case that make use of <w:altChunk/> element in Word document by inject (fragment of) HTML file as alternate chunks and let Word do it works when the file gets opened. The current implementation was using XML/XSL to compose WordML XML, modify relationships, and do all packaging stuffs manually which is a real pain.
I wanted to move to python-docx but the API doesn't support this directly. Currently I found a way to add the <w:altChunk/> in the document XML. But still struggle to find a way to add relationship and related file to the package.
I think I should make a compatible part and pass it to document.part.relate_to function to do its job. But still can't figure how to:
from docx import Document
from docx.oxml import OxmlElement, qn
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, chunk_part):
''' TODO: figuring how to add files and relationships'''
r_id = doc.part.relate_to(chunk_part, RT.A_F_CHUNK)
alt = OxmlElement('w:altChunk')
alt.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt)
Update:
As per scanny's advice, below is my working code. Thank you very much Steve!
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.opc.part import Part
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, html: str):
package = doc.part.package
partname = package.next_partname('/word/altChunk%d.html')
alt_part = Part(partname, 'text/html', html.encode(), package)
r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
alt_chunk = OxmlElement('w:altChunk')
alt_chunk.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt_chunk)
doc = Document()
doc.add_paragraph('Hello')
add_alt_chunk(doc, "<body><strong>I'm an altChunk</strong></body>")
doc.add_paragraph('Have a nice day!')
doc.save('test.docx')
Note: the altChunk parts only work/appear when document is open using MS Word
Well, some hints here anyway. Maybe you can post your working code at the end as a full "answer":
The alt-chunk part needs to start its life as a docx.opc.part.Part object.
The blob argument should be the bytes of the file, which is often but not always plain text. It must be bytes though, not unicode (characters), so any encoding has to happen before calling Part().
I expect you can work out the other arguments:
package is the overall OPC package, available on document.part.package.
You can use docx.opc.package.OpcPackage.next_partname() to get an available partname based on a root template like: "altChunk%s" for a name like "altChunk3". Check what partname prefix Word uses for these, possibly with unzip -l has-an-alt-chunk.docx; should be easy to spot.
The content-type is one in docx.opc.constants.CONTENT_TYPE. Check the [Content_Types].xml part in a .docx file that has an altChunk to see what they use.
Once formed, the document_part.relate_to() method will create the proper relationship. If there is more than one relationship (not common) then you need to create each one separately. There would only be one relationship from a particular part, just some parts are related to more than one other part. Check the relationships in an existing .docx to see, but pretty good guess it's only the one in this case.
So your code would look something like:
package = document.part.package
partname = package.next_partname("altChunkySomethingPrefix")
content_type = docx.opc.constants.CONTENT_TYPE.THE_RIGHT_MIME_TYPE
blob = make_the_altChunk_file_bytes()
alt_chunk_part = Part(partname, content_type, blob, package)
rId = document.part.relate_to(alt_chunk_part, RT.A_F_CHUNK)
etc.
I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.
I want to be able to parse specific content from a website into a mySQL database. For example, on site http://allrecipes.com/Recipe/Fluffy-Pancakes-2/Detail.aspx I want to parse into my database (which has a table with columns RecipeName, Ingredients 1-10).
So basically my database will contain the name and all the ingredients for that recipe. There is no need to edit the content, simply parse them in as is (i.e. 3/4 cup milk) since i am using character in my database.
How exactly do I go about doing this? I was looking a pre-built parsers and it seems its tough to find one that's easy to use since I am fairly new to programming. Of course, I can manually enter values in but I want to parse them in.
Would it be possible to just parse this content and write a file that has a RecipieName, Ingredient string which I can then parse into my database? Or should I just do it directly into the database? I am unsure as to how to connect a database to a parser also directly, but I might be able to find some information online.
Basically, I am looking for help on how to exactly go about doing this since I am not very well versed in programming and this seems to be a lot more complicated than it might be.
I am using Java as my main language right now, although I can't say I am very good at it. But I should be able to understand the basic concepts.
Any suggestions on what parser to use or how to do this?
Thanks!
This is how I would do it in PHP. This is almost certainly NOT the most efficient way to do it, nor has it been debugged.
function parseHTML($rawHTML){
$startPosition = strpos($rawHTML,'<div class="ingredients"'); //Find the position of the beginning of the ingredients list, return the character number.
$endPosition = strpos($rawHTML,'</div>',$startPosition); //Find the position of the end of the ingredients list, begin searching from the beginning of the list (found in step 1)
$relevantPart = substr($rawHTML,$startPosition,$endPosition); //Isolate the ingredients list
$parsedString = strip_tags($relevantPart); //Strip the HTML tags off of the ingredients list
return $parsedString;
}
Still to be done: You say you have a mySQL database with 10 separate ingredients columns. This code outputs everything as one big string. You would have to change the strip_tags($relevantPart) function to strip_tags($relevantPart,"<li>"). That would let the <li> tags through. Then, you would have to loop through every <li> tag, performing a similar function to this. It shouldn't be too hard, but I don't feel comfortable writing it with no functioning PHP server.
I have a hashmap with some information(key and value) in a perl file. I want to display them in HTML output and each displayed (key, value) will link to something. When I click the link then there will be some information there.
Anyone suggests me how can I do that. Is this similar to creating a CGI file and use CGI.pm? I will update more detail on this question later.
Yes, you can use the excellent CGI module to render HTML content for you, even if you are not processing CGI forms (i.e. use the module only on output, rather than also for input processing):
use CGI;
my $q = CGI->new;
my #html_list = map {
$q->li($_ . ": " . $hash{$_};
} keys %hash;
print $q->ul($q->li({-type=>'foo'}, #html_list);
Depending on the data you're trying to display, something like HTML::Table may be useful if you want to display it in tabular format and don't want the drugery of assembling the appropriate HTML yourself.
For instance, you could do something like:
my $table = HTML::Table->new(-columns => 2);
for my $key (sort keys %hash) {
$table->addRow($key, $hash{$key});
}
$table->print;
Also, there is a free Beginning Perl book available online, which has a chapter devoted to CGI scripts, along with a lot of other useful information.
If this is more than just a simple one-off script, you might also wish to consider using one of the many Perl web frameworks like Dancer, Catalyst, Mojo etc.
I'm trying to write a parser to get the data out of a typical html table day/time schedule (like this).
I'd like to give this parser a page and a table class/id, and have it return a list of events, along with days & times they occur. It should take into account rowspans and colspans, so for the linked example, it would return
{:event => "Music With Paul Ray", :times => [T 12:00am - 3:00am, F 12:00am - 3:00am]}, etc.
I've sort of figured out a half-executed messy approach using ruby, and am wondering how you might tackle such a problem?
The best thing to do here is to use a HTML parser. With a HTML parser you can look at the table rows programmatically, without having to resort to fragile regular expressions and doing the parsing yourself.
Then you can run some logic along the lines of (this is not runnable code, just a sketch that you should be able to see the idea from):
for row in table:
i = 0
for cell in row: # skipping row 1
event = name
starttime = row[0]
endtime = table[ i + cell.rowspan + 1 ][0]
print event, starttime, endtime
i += 1
This is what the program will need to do:
Read the tags in (detect attributes and open/close tags)
Build an internal representation of the table (how will you handle malformed tables?)
Calculate the day, start time, and end time of each event
Merge repeated events into an event series
That's a lot of components! You'll probably need to ask a more specific question.
Use http://www.crummy.com/software/BeautifulSoup/ and that task should be a breeze.
As said, using regexes on HTML is generally a bad idea, you should use a good parser.
For validating XHTML pages, you can use a simple XML parser which is available in most languages. Alas, in your case, the given page doesn't validate (W3C's markup validation service reports 230 Errors, 7 warning(s)!)
For generic, possibly malformed HTML, there are libraries to handle that (kigurai recommends BeautifulSoup for Python, I know also TagSoup for Java, there are others).