How to get data from big table row by row - mysql

I need to get all data from mysql table. What I've try so far is:
my $query = $connection->prepare("select * from table");
$query->execute();
while (my #row=$query->fetchrow_array)
{
print format_row(#row);
}
but there is always a but...
Table has about 600M rows and apparently all results from query is store in memory after execute() command. There is not enough memory for this:(
My question is:
Is there a way to use perl DBI to get data from table row by row?Something like this:
my $query = $connection->prepare("select * from table");
while (my #row=$query->fetchrow_array)
{
#....do stuff
}
btw, pagination is to slow:/

apparently all results from query is store in memory after execute() command
That is the default behaviour of the mysql client library. You can disable it by using the mysql_use_result attribute on the database or statement handle.
Note that the read lock you'll have on the table will be held much longer while all the rows are being streamed to the client code. If that might be a concern you may want to use SQL_BUFFER_RESULT.

The fetchall_arrayref
method takes two parameters, the second of which allows you to limit the number of rows fetched from the table at once
The following code reads 1,000 lines from the table at a time and processes each one
my $sth = $dbh->prepare("SELECT * FROM table");
$sth->execute;
while ( my $chunk = $sth->fetchall_arrayref( undef, 1000 ) ) {
last unless #$chunk; # Empty array returned at end of table
for my $row ( #$chunk ) {
print format_row(#$row);
}
}

When working with Huge Tables I build Data Packages with dynamically built SQL Statements like
$sql = "SELECT * FROM table WHERE id>" . $lastid . " ORDER BY id LIMIT " . $packagesize
The Application will dynamically fill in $lastid according to each Package it processes.
If table has an ID Field id it has also an Index built on that Field so that the Performance is quite well.
It also limits Database Load by little rests between each Query.

Related

How should I select the next item to process when running processors in parallel?

I'm asking this question without database specifics because it feels like the answer may lie in a common design pattern, and I don't necessarily need a system specific solution ( my specific system setup is referenced at the end of the question ).
I've got a database of companies containing an id, a url, and a processing field, to indicate whether or not that company is currently being processed by one of my crawlers. I run many crawlers in parallel. Each one needs to select a company to process and set that company as processing before it starts so that each company is only being processed by a single crawler at any given time.
How should I structure my system to keep track of what companies are being processed?
The challenge here is that I cannot search my database for a company that is not being processed and then update that company to set it as processed, because another crawler may have chosen it in the meantime. This seems like something that must be a common problem when processing data in parallel so I'm looking for a theoretical best practice.
I used to use MySQL for this and used the following code to maintain consistency between my processors. I'm redesigning the system, however, and now ElasticSearch is going to be my main database and search server. The MySQL solution below always felt like a hack to me and not a proper solution to this paralellization problem.
public function select_next()
{
// set a temp variable that allows us to retrieve id of the row that is updated during next query
$sql = 'SET #update_id := 0';
$Result = $this->Mysqli->query( $sql );
if( ! $Result )
die( "\n\n " . $this->Mysqli->error . "\n" . $sql );
// selects next company to be crawled, marks as crawling in the db
$sql = "UPDATE companies
SET
crawling = 1,
id = ( SELECT #update_id := id )
WHERE crawling = 0
ORDER BY last_crawled ASC, id ASC
LIMIT 1";
$Result = $this->Mysqli->query( $sql );
if( ! $Result )
die( "\n\n " . $this->Mysqli->error . "\n" . $sql );
// this query returned at least one result and there are companies to be crawled
if( $this->Mysqli->affected_rows > 0 )
{
// gets the id of the row that was just updated in the previous query
$sql = 'SELECT #update_id AS id';
$Result = $this->Mysqli->query( $sql );
if( ! $Result )
die( "\n\n " . $this->Mysqli->error . "\n" . $sql );
// set company id
$this->id = $Result->fetch_object()->id;
}
}
One approach that is often used for such problems is sharding. You can define a deterministic function that assigns each row in the database to a crawler. In your case, such function can simply be a company id modulo number of crawlers. Each crawler can sequentially process companies that belong to this worker shard, which guarantees no companies are ever processed simultaneously.
Such approach is used for example by Reduce part of MapReduce.
An advantage is that no transactions or locking are required which are hard to implement and often are a bottleneck, especially in a distributed environment. A disadvantage is that work can be partitioned not equally between crawlers, in which case some crawlers are idle when others are still processing.

Using fetchrow_hashref to store data

I am trying to take information out of a MySQL database, which I will then manipulate in perl:
use strict;
use DBI;
my $dbh_m= DBI->connect("dbi:mysql:Populationdb","root","LisaUni")
or die("Error: $DBI::errstr");
my $Genotype = 'Genotype'.1;
#The idea here is eventually I will ask the database how many Genotypes there are, and then loop it round to complete the following for each Genotype:
my $sql =qq(SELECT TransNo, gene.Gene FROM gene JOIN genotypegene ON gene.Gene = genotypegene.Gene WHERE Genotype like '$Genotype');
my $sth = $dbh_m-> prepare($sql);
$sth->execute;
my %hash;
my $transvalues = $sth->fetchrow_hashref;
my %hash= %$transvalues;
$sth ->finish();
$dbh_m->disconnect();
my $key;
my $value;
while (($key, $value) = each(%hash)){
print $key.", ".$value\n; }
This code doesn't produce any errors, but the %hash only stores the last row taken from the database (I got the idea of writing it this way from this website). If I type:
while(my $transvalues = $sth->fetchrow_hashref){
print "Gene: $transvalues->{Gene}\n";
print "Trans: $transvalues->{TransNo}\n";
}
Then it does print off all the rows, but I need all this information to be available once I've closed the connection to the database.
I also have a related question: in my MySQL database the row consists of e.g 'Gene1'(Gene) '4'(TransNo). Once I have taken this data out of the database as I am doing above, will the TransNo still know which Gene it is associated with? Or do I need to create some kind of hash of hash structure for that?
You are calling the "wrong" function
fetchrow_hashref will return one row as a hashref, you should wrap it's use inside a loop, ending it when fetchrow_hashref returns undef.
It seems like you are looking for fetchall_hashref, that will give you all of the returned rows as a hash with the first parameter specified what field to use as a key.
$hash_ref = $sth->fetchall_hashref ($key_field);
Each row will be inserted into $hash_ref as an internal hashref, using $key_field as the key in which you can find the row in $hash_ref.
What does the documentation say?
The fetchall_hashref method can be used to fetch all the data to be returned from a prepared and executed statement handle.
It returns a reference to a hash containing a key for each distinct value of the $key_field column that was fetched.
For each key the corresponding value is a reference to a hash containing all the selected columns and their values, as returned by fetchrow_hashref().
Documentation links
DBI - search.cpan.org #fetchrow_hashref
DBI - search.cpan.org #fetchall_hashref

MySQL: The easiest way to display all data from the table by getting TABLE_NAME olny (no column information) on the html page

MySQL db named "db2011" has several tables.
Using php variable $tablename (the tablename is correct, the table exists in the db), how to display the data from "db2011".$tablename on the html page?
Can it be done by performing just 1 query - to select all data by $tablename?
I think it can be done in 2 steps, but I'm asking any better solution (in case if this one is not good):
Step1: Get column names by performing
"SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='db2011' AND TABLE_NAME='".$tablename."'";
Step2: Build and perform a query with SELECT + list of columns separated by comma FROM $tablename?
P.S. I know the data could be huge to be displayed on the html page. I will limit it by 100 rows.
Any better ways?
Thank you.
I am assuming you are doing this in PHP. It may not be elegant, but it gets it in one query. I think you want to display the table columns as well as the data in one query.
<?php
$sql = "SELECT * FROM $tablename";
$res = mysql_query($sql);
$firstpass = TRUE;
while($row = mysql_fetch_assoc($res)){
if($firstpass){
foreach($row as $key => $value) {
//store all the column names here ($key)
$firstpass = FALSE;
}
}
//Collect all the column information down here.
}
?>
Why not just SELECT * FROM $tablename limit 100;?
You would get all the column names back in the result set. Unless you also need the column type on your webpage I would go with just that

Pulling all posts in vBulletin

I want to create a script that counts threads words' count in a vBulletin forum. Bascially, I want to pull that database from the mysql database, and play with it. I don't have experience working with vBulettin, so I'm thinking of two ways:
Does vBulletin provides API to handle database stuff. (Allow me to grab all the threads content, and URLs). I'm almost sure there is, a link where to start?
Is there a solution doing this without the interferance of vBulletin. This means grab the data manually from the mysql database and do stuff the typical way.
I'll prefer the second method if the vBulettin learning curve is too steep. Thanks for the advice.
Is this for vBulletin 3 or 4?
I mostly work with vB3, and the quickest way to include all of the vBulletin resources is to create a new php file in your forums directory with the following code.
<?php
error_reporting(E_ALL & ~E_NOTICE & ~8192);
require_once('./global.php');
var_dump($vbulletin);
That $vbulletin variable is the registry object that contains just about everything you're ever going to need, including the database connection and it's read and write functions, userinfo data, cleaned _POST and _REQUEST values, and a lot more (phrases, session data, caches, etc).
There are 4 database functions you'll use the most.
$vbulletin->db->query_read() // fetch more than one row
$vbulletin->db->fetch_array() // converts the query_read returned data into an array
$vbulletin->db->query_first() // fetches a single row
$vbulletin->db->query_write() // update, insert or delete rows, tables, etc.
query_read is what you would use when you expect more than one result that you intend to loop through. For example, if you wanted to count all the words in a single thread, you would would need to query the post table with the threadid, loop through each post in that thread and count all the words in the message.
Note: "TABLE_PREFIX" is a constant set in config.php. It's important to always prepend the table name with that constant in case other forums decide to prefix their tables.
<?php
error_reporting(E_ALL & ~E_NOTICE & ~8192);
require_once('./global.php');
$threadid = 1;
// fetch all post from a specfic thread
$posts = $vbulletin->db->query_read("
SELECT pagetext
FROM " . TABLE_PREFIX . "post
WHERE threadid = $threadid
");
/**
* Loop through each post.
*
* Here we use the "fetch_array" method to convert the MySQL data into
* a useable array. 99% of the time you'll use "fetch_array" when you
* use "query_read".
*
* $post will contains the post data for each loop. In our case, we only
* have "pagetext" avaliable to use since that's all we told MySQL we needed
* in our query. You can do SELECT * if you want ALL the table data.
*/
while ($post = $vbulletin->db->fetch_array($posts)) {
$totalWords = $totalWords + str_word_count($post['pagetext']);
}
/**
* Print the total number of words this thread contains.
*
* The "vb_number_format" is basically wrapper of php's "number_format", but
* with some vBulletin extras. You can visit the /includes/functions.php file
* for all the functions available to you. Many of them are just convenient
* functions so you don't have to repeat a lot of code. Like vBDate(), or
* is_valid_email().
*/
echo sprintf("Thread ID %i contains %s words", $threadid, vb_number_format($totalWords));
The query_first function is what you would use when you need to fetch a single row from the database. No looping required or anything like that. If, for instances, you wanted to fetch a single user's information from the database - which you don't need a query for, but we'll do it as an example - you would use something like this.
<?php
error_reporting(E_ALL & ~E_NOTICE & ~8192);
require_once('./global.php');
$userid = 1;
$user = $vbulletin->db->query_first("
SELECT *
FROM " . TABLE_PREFIX . "user
WHERE userid = $userid
");
echo sprintf("Hello, %s. Your email address is %s and you have %s posts",
$user['username'],
$user['email'],
vb_number_format($user['posts'])
);
Lastly, if you wanted to update something in the database, you would use "query_write". This function is pretty straight forward. This function just takes any MySQL update insert or delete query. For example, if I needed to update a user's yahoo id, you would do.
<?php
error_reporting(E_ALL & ~E_NOTICE & ~8192);
require_once('./global.php');
$userid = 1;
$update = $vbulletin->db->query_write("
UPDATE " . TABLE_PREFIX . "user
SET yahoo = 'myYahooID#yahoo.com'
WHERE userid = $userid
");
if ($update) {
$userinfo = fetch_userinfo($userid);
echo sprintf("Updated %s yahoo ID to %s", $userinfo['username'], $userinfo['yahoo']);
}
Hopefully this will help you get started. I would recommend using vBulletin 3 for a little while until you're comfortable with it. I think it'll be easier on you. Most of this will translate to vBulletin 4 with some tweaking, but that code base is not as friendly to new comers.

How can I INSERT 1 million entries to my MySQL DB?

I want to test the speed of my SQL queries (update queries) with a real "load" on my DB. I'm relatively fresh to DB's and I am doing more complex queries than I have before, and I'm getting scared by people talking about performance like "30 seconds for 3000 records to be updated" etc. So I want to have a concrete experiment showing what my performance will be in production.
To achieve this, I want to add 10k, 100k, 1M, 10M records to my DB and then run my query.
My issue is, how can I do this? I have a "name" primary key field that must be unique and be <= 15 characters and have alphanumeric entry. The other fields I want to be the same for all created entries (i.e. a "foo" field I want to start at 10000)
If there's a way to do this and get approximately 1M entries (i.e. could be name collisions) that's fine. I'm just looking for a benchmarking dataset.
If there's a better way to benchmark my query, I'm all ears. I'm planning to simply execute and see how long the query says it takes.
Edit: It's worth noting that this is for a server and has nothing to do with "The Web" so I don't have access to PHP. I'm seeing some PHP scripts to populate, is there perhaps a way to have a perl script write out all these queries and then suck them in to the command line mysql tools?
I'm not sure of how to use just MySQL to accomplish this, but if you have access to PHP, then use this:
<?php
$start = time();
$interval = 10000000; // 10M
$con = mysql_connect( 'server', 'user', 'pass' );
mysql_select_db( 'database' );
for ( $i = 0; $i < $interval; $i++ )
{
mysql_query( 'INSERT INTO TABLE (fields) VALUES (values)', $con );
}
$endt = time();
$diff = ( $endt - $start );
print( "{$interval} queries took " . date( 'g:i:s', $diff ) . " to execute." );
?>
If you want to optimize querys you should look into the EXPLAIN statement of MySQL.
To populate your database I would suggest you write your own litte PHP script or check out this one
http://www.generatedata.com
Regarding your edit:
you could generate a big text file with perl and then use the MySQL CLI to load the file into the table, for more info please see:
http://dev.mysql.com/doc/refman/5.0/en/loading-tables.html
You just want to prepopulate your database so that you have something to run your queries against, and you are not benchmarking the initial insertion process?
In that case, just generate your input data as a tab-delimited file and use mysqlimport to populate your database.