tl;dr: I'm looking for a way to find entries in our database which are missing information, getting that information from a website and adding it to the database entry.
We have a media management program which uses a mySQL table to store the information. When employees download media (video files, images, audio files) and import it into the media manager they are suppose to also copy the description of the media (from the source website) and add it to the description in the Media Manager. However this has not been done for thousands of files.
The file name (eg. file123.mov) is unique and the details page for that file can be accessed by going to a URL on the source website:
website.com/content/file123
The information we want to scrape from that page has an element ID which is always the same.
In my mind the process would be:
Connect to database and Load table
Filter: "format" is "Still Image (JPEG)"
Filter: "description" is "NULL"
Get first result
Get "FILENAME" without extension)
Load the URL: website.com/content/FILENAME
Copy contents of the element "description" (on website)
Paste contents into the "description" (SQL entry)
Get 2nd result
Rinse and repeat until last result is reached
My question(s) are:
Is there software that could perform such a task or is this something that would need to be scripted?
If scripted, what would be the best type of script (eg could I achieve this using AppleScript or would it need to be made in java or php etc.)
Is there software that could perform such a task or is this something that would need to be scripted?
I'm not aware of anything that will do what you want out of the box (and even if there was, the configuration required won't be much less work than the scripting involved in rolling your own solution).
If scripted, what would be the best type of script (eg could I achieve this using AppleScript or would it need to be made in java or php etc.)
AppleScript can't connect to databases, so you will definitely need to throw something else into the mix. If the choice is between Java and PHP (and you're equally familiar with both), I'd definitely recommend PHP for this purpose, as there will be considerably less code involved.
Your PHP script would look something like this:
$BASEURL = 'http://website.com/content/';
// connect to the database
$dbh = new PDO($DSN, $USERNAME, $PASSWORD);
// query for files without descriptions
$qry = $dbh->query("
SELECT FILENAME FROM mytable
WHERE format = 'Still Image (JPEG)' AND description IS NULL
");
// prepare an update statement
$update = $dbh->prepare('
UPDATE mytable SET description = :d WHERE FILENAME = :f
');
$update->bindParam(':d', $DESCRIPTION);
$update->bindParam(':f', $FILENAME);
// loop over the files
while ($FILENAME = $qry->fetchColumn()) {
// construct URL
$i = strrpos($FILENAME, '.');
$url = $BASEURL . (($i === false) ? $FILENAME : substr($FILENAME, 0, $i));
// fetch the document
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
// get the description
$DESCRIPTION = $doc->getElementsById('description')->nodeValue;
// update the database
$update->execute();
}
I too am not aware of any existing software packages that will do everything you're looking for. However, Python can connect to your database, make web requests easily, and handle dirty html. Assuming you already have Python installed, you'll need three packages:
MySQLdb for connecting to the database.
Requests for easily making http web requests.
BeautifulSoup for robust parsing of html.
You can install these packages with pip commands or Windows installers. Appropriate instructions are on each site. The whole process won't take more than 10 minutes.
import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup
# Connect to the database. Fill in these fields as necessary.
con = db.connect(host='hostname', user='username', passwd='password',
db='dbname')
# Create and execute our SELECT sql statement.
select = con.cursor()
select.execute('SELECT filename FROM table_name \
WHERE format = ? AND description = NULL',
('Still Image (JPEG)',))
while True:
# Fetch a row from the result of the SELECT statement.
row = select.fetchone()
if row is None: break
# Use Python's built-in os.path.splitext to split the extension
# and get the url_name.
filename = row[0]
url_name = os.path.splitext(filename)[0]
url = 'http://www.website.com/content/' + url_name
# Make the web request. You may want to rate-limit your requests
# so that the website doesn't get angry. You can slow down the
# rate by inserting a pause with:
#
# import time # You can put this at the top with other imports
# time.sleep(1) # This will wait 1 second.
response = requests.get(url)
if response.status_code != 200:
# Don't worry about skipped urls. Just re-run this script
# on spurious or network-related errors.
print 'Error accessing:', url, 'SKIPPING'
continue
# Parse the result. BeautifulSoup does a great job handling
# mal-formed input.
soup = BeautifulSoup(response.content)
description = soup.find('div', {'id': 'description'}).contents
# And finally, update the database with another query.
update = db.cursor()
update.execute('UPDATE table_name SET description = ? \
WHERE filename = ?',
(description, filename))
I'll warn that I've made a good effort to make that code "look right" but I haven't actually tested it. You'll need to fill in the private details.
PHP is a good scraper . I have made a class that wraps the cURL port of PHP here:
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
You'll probably need to use some of the options:
http://www.php.net/manual/en/function.curl-setopt.php
To scrape HTML, I usually use regular expressions, but here is a class I made that should be able to query HTML without issues:
http://pastebin.com/Jm9jKjAU
Useage is:
$h = new HTMLQuery();
$h->load( $string_containing_html );
$h->getElements( 'p', 'id' ); // Returns all p tags with an id attribute
The best option to scrape would be XPath, but it can't handle dirty HTML. You can use that to do things like:
//div[#class = 'itm']/p[last() and text() = 'Hello World'] <- selects the last p in div elements that have the innerHTML 'Hello World'
You can use that in PHP with the DOM class (built-in).
Related
I'm trying to get a result from a MySQL database and display selected values into a .tpl template file.
This is what I tried so far:
{php}
$clienthosting = $this->get_template_vars(service);//Here is where the exception is thrown
$dbid = $clienthosting['id'];
$query = mysql_query("SELECT dedicatedip FROM tblhosting WHERE id = $dbid");
$result = mysql_fetch_array($query);
$dedicatedip = $result["dedicatedip"];
$this->assign("dedicatedip", $dedicatedip);
{/php}
But it generated the following error:
Something went wrong and we couldn't process your request.
What am I doing wrong?
Thanks.
WHMCS recommends not use {php} inside tpl files, you can use hooks instead to add variables and use them in the TPL file.
But you can enable it in the settings: Setup > General Settings > Security > Allow Smarty PHP Tags.
Also, if you're running PHP 7 the mysql extension is removed, you can't use functions like mysql_query and mysql_fetch_array.
You should use Capsule and Eloquent as recommended at Interacting with the Database page
I've got 200k csv files and I need to import them all to a single postgresql table. It's a list of parameters from various devices and each csv's file name contains device's serial number and I need it to be in one of the colums for each row.
So to simplify, I've got few columns of data (no headers), let's say that columns in each csv file are: Date, Variable, Value and file name contains SERIALNUMBER_and_someOtherStuffIDontNeed.csv
I'm trying to use cygwin to write a bash script to iterate over files and do it for me, however for some reason it won't work, showing 'syntax error at or near "as" '
Here's my code:
#!/bin/bash
FILELIST=/cygdrive/c/devices/files/*
for INPUT_FILE in $FILELIST
do
psql -U postgres -d devices -c "copy devicelist
(
Date,
Variable,
Value,
SN as CURRENT_LOAD_SOURCE(),
)
from '$INPUT_FILE
delimiter ',' ;"
done
I'm learning SQL so it might be an obvious mistake, but I can't see it.
Also I know that in that form I will get full file name, not just the serial number bit I want but I can probably handle that somehow later.
Please advise.
Thanks.
I dont think there is a CURRENT_LOAD_SOURCE() function in postgres. A work-around is to leave the name-column NULL on copy, and patch is to the desired value just after the copy. I prefer a shell here-document because that make quoting inside the SQL body easier. (BTW: for 10K of files, the globbing needed to obtain FILELIST might exceed argmax for the shell ...)
#!/bin/bash
FILELIST="`ls /tmp/*.c`"
for INPUT_FILE in $FILELIST
do
echo "File:" $INPUT_FILE
psql -U postgres -d devices <<OMG
-- I have a schema "tmp" for testing purposes
CREATE TABLE IF NOT EXISTS tmp.filelist(name text, content text);
COPY tmp.filelist ( content)
from '/$INPUT_FILE' delimiter ',' ;
UPDATE tmp.filelist SET name = '$FILELIST'
WHERE name IS NULL;
OMG
done
For anyone interested in an answer, I've used a python script to change file names and then another script using psycopg2 to connect to the database and then done everyting in one connection. Took 10 minutes instead of 10 hours.
Here's the code:
Renaming files (also apparently to import from CSV you need all the rows to be filled and the information I needed was in first 4 columns anyway, therefore I've put together a solution to generate whole new CSVs instead of just renaming them):
import os
import csv
path='C:/devices/files'
os.chdir(path)
i=0
for file in os.listdir(path):
try:
i+=1
if i%10000 == 0:
#just to see the progress
print(i)
serial_number = (file[:8])
creader = csv.reader(open(file))
cwriter = csv.writer(open('processed_'+file, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline) if col not in (4, 5, 6, 7)]
new_line.insert(0, serial_number)
#print(new_line)
cwriter.writerow(new_line)
except:
print('problem with file: ' + file)
pass
Updating database:
import os
import psycopg2
path="C:\\devices\\files"
directory_listing = os.listdir(path)
conn = psycopg2.connect("dbname='devices' user='postgres' host='localhost'")
cursor = conn.cursor()
print(len(directory_listing))
i=100001
while i < 218792:
current_file=(directory_listing[i])
i+=1
full_path = "C:/devices/files/" + current_file
with open(full_path) as f:
cursor.copy_from(file=f, table='devicelistlive', sep=",")
conn.commit()
conn.close()
Don't mind while and weird numbers, it's just because I was doing it in portions for testing purposes. Can easily be replaced with for
I have coded a Ruby IRC bot which is on github (/ninjex/rubot) which is having some conflicting output with MySQL on a dedicated server I just purchased.
Firstly we have the connection to the database in the MySQL folder (in .gitignore) which looks similar to the following code block.
#con = Mysql.new('localhost', 'root', 'pword', 'db_name')
Then we have an actual function to query the database
def db_query
que = get_message # Grabs query from user i.e,./db_query SELECT * FROM words
results = #con.query(que) # Send query through the connection i.e, #con.query("SELECT * FROM WORDS")
results.each {|x| chan_send(x)} # For each row returned, send it to the channel via
end
On my local machine, when running the command:
./db_query SELECT amount, user from words WHERE user = 'Bob' and word = 'hello'
I receive the output in IRC in an Array like fashion: ["17", "Bob"] Where 17 is amount and Bob is the user.
However, using this same function on my dedicated server results in an output like: 17Bob I have attempted many changes in the code, as well as try to parse the data into it's own variable, however it seems that 17Bob is coming out as a single variable, making it impossible to parse into something like an array, which I could then use to send the data correctly.
This seems odd to me on both my local machine and the dedicated server, as I was expecting the output to first send 17 to the IRC and then Bob like:
17
Bob
For all the functions and source you can check my github /Ninjex/rubot, however you may need to install some gems.
A few notes:
Make sure you are sanitizing query via get_message. Or you are opening yourself up to some serious security problems.
Ensure you are using the same versions of the mysql gem, ruby and MySql. Differences in any of these may alter the expected output.
If you are at your wits end and are unable to resolve the underlying issue, you can always send a custom delimiter and use it to split. Unfortunately, it will muck up the case that is actually working and will need to be stripped out.
Here's how I would approach debugging the issue on the dedicated machine:
def db_query
que = get_sanitized_message
results = #con.query(que)
require 'pry'
binding.pry
results.each {|x| chan_send(x)}
end
Add the pry gem to your Gemfile, or gem install pry.
Update your code to use pry: see above
This will open up a pry console when the binding.pry line is hit and you can interrogate almost everything in your running application.
I would take a look at results and see if it's an array. Just type results in the console and it will print out the value. Also type out results.class. It's possible that query is returning some special result set object that is not an array, but that has a method to access the result array.
If results is an array, then the issue is most likely in chan_send. Perhaps it needs to be using something like puts vs print to ensure there's a new line after each message. Is it possible that you have different versions of your codebase deployed? I would also add a sleep 1 within the each block to ensure that this is not related to your handling of messages arriving at the same time.
I want to store huge content in db and my sample text is 16129 characters in length when i tried to execute this query it is showing "error:The requested URL could not be retrieved" in firefox and "no-data received" in chrome.
Moreover I use LONGTEXT as datatype for text in DB.
I also tried to execute the query directly in phpmyadmin it is working correctly.
The code is shown below.
public function _getConnection($type = 'core_write') {
return Mage::getSingleton('core/resource')->getConnection($type);
}
public function testdbAction(){
$db = $this->_getConnection();
$current_time=now();
$text="The European languages are members of the same family...... ...Europe uses the same vocabulary. The ";//text is 16129 characters in length
$sql = "INSERT into test(`usercontent_id`,`app_id`,`module_id`,`customer_id`,`content`,`created_time`,`updated_time`,`item_id`,`index_id`,`position_id`) VALUES (NULL, 15, 9,2,'" .$text. "','" . $current_time . "','" . $current_time . "',1003,5,4)";
$db->query($sql);
}
How do i handle this? any suggestions or help..
Try using $db->exec($sql) instead of $db->query($sql)
For manipulating with database structure and data Magento has a dedicated place. It is called install / upgrade scripts. It was made to keep track of modules version for easy updates. So you should use a script to add new data.
Here's an example.
config.xml of your module:
<modules>
<My_Module>
<version>0.0.1</version>
</My_Module>
</modules>
<global>
<resources>
<my_module_setup>
<setup>
<module>My_Module</module>
<class>Mage_Core_Model_Resource_Setup</class>
</setup>
</my_module_setup>
</resources>
</global>
Now, you need to create following file:
My_Module/sql/my_module_setup/install-0.0.1.php
Now, depending on your Magento version, file would need to have a different name. If you're using Magento CE 1.4 or lower (EE 1.8) then you should name it mysql4-install-0.0.1.php.
This script will be launched at next website request. Class Mage_Core_Model_Resource_Setup will execute code inside install-0.0.1.php. From within the install script you will have access to the Setup class by using $this object;
So now, you can write in script following code:
$this->startSetup();
$text = <<<TEXT
YOUR LONG TEXT GOES HERE
TEXT;
$this->getConnection()
->insert(
$this->getTable('test'),
array(
'usercontent_id' => null,
'app_id' => 15,
'content' => $text,
//other fields in the same fashion
)
);
$this->endSetup();
And that's it. It's a clean and appropriate way of adding custom data to the database in Magento.
If you want to save a user input on a regular basis using forms, then I recommend creating w model, resource model and collection, define entities in config.xml. For more information please refer to Alan Storm's articles, like this one.
I hope that I understood your question correctly.
I stumbled upon the following:
def save_formset(self, request, form, formset, change):
instances = formset.save(commit=False)
bargain_id = 0
total_price = Decimal(0)
for instance in instances:
if isinstance(instance, BargainProduct):
total_price += instance.quantity * instance.product.price
bargain_id = instance.id
instance.save()
updateTotal = Bargain.objects.get(id=bargain_id)
updateTotal.total_price = total_price - updateTotal.discount_price
updateTotal.save()
This code is working for me on my local MySQL setup, however, on my live test enviroment running on SQLite3* I get the "Bargain matching query does not exist." error..
I am figuring this is due to a different hierarchy of saving the instances on SQLite.. however it seems they run(and should) act the same..?
*I cannot recompile MySQL with python support on my liveserver atm so thats a no go
Looking at the code, if you have no instances coming out of the formset.save(), bargain_id will be 0 when it gets down to the Bargain.objects.get(id=bargain_id) line, since it will skip over the for loop. If it is 0, I'm guessing it will fail with the error you are seeing.
You might want to check to see if the values are getting stored correctly in the database during your formset.save() and it is returning something back to instances.
This line is giving the error:
updateTotal = Bargain.objects.get(id=bargain_id)
which most probably is because of this line:
instances = formset.save(commit=False)
Did you define a save() method for the formset? Because it doesn't seen to have one built-in. You save it by accessing what formset.cleaned_data returns as the django docs say.
edit: I correct myself, it actually has a save() method based on this page.
I've been looking at this same issue. It is saving the data to the database, and the formset is filled. The problem is that the save on instances = formset.save(commit=False) doesn't return a value. When I look at the built-in save method, it should give back the saved data.
Another weird thing about this, is that it seems to work on my friends MySQL backend, but not on his SQLITE3 backend. Next to that it doesn't work on my MySQL backend.
Local loop returns these print outs (on MySQL).. on sqlite3 it fails with a does not excist on the query
('Formset: ', <django.forms.formsets.BargainProductFormFormSet object at 0x101fe3790>)
('Instances: ', [<BargainProduct: BargainProduct object>])
[18/Apr/2011 14:46:20] "POST /admin/shop/deal/add/ HTTP/1.1" 302 0