MySQL C API with Splint: Freeing fields and rows - mysql

I'm trying to use Splint with MySQL C API and have run in to some additional problems relating to freeing memory. In all the examples I can find about using the C API, the only freeing function that is called is mysql_free_result, but rows and fields are never freed:
while ((row = mysql_fetch_row(result)))
{
for(int i = 0; i < num_fields; i++)
{
if (i == 0)
{
while(field = mysql_fetch_field(result))
{
// Bla
}
}
}
}
mysql_free_result(result);
mysql_close(con);
Splint will of course throw an error saying that row and field are memory leaks. Are they? How to resolve this? Should I manually free the row and field structs?
Edit: OK, since mysql_fetch_row returns null when there is no more rows, I assume it also frees all memory from former row each loop. But how the hell would I tell Splint this?
EDit 2: Here's the implementation for mysql_fetch_row (version 5.5). No memory is allocated, the function just point you to the next row. So what's needed is a Splint annotation for the function that tells Splint no memory is allocated, but shared.
MYSQL_ROW STDCALL
mysql_fetch_row(MYSQL_RES *res)
{
DBUG_ENTER("mysql_fetch_row");
if (!res->data)
{ /* Unbufferred fetch */
if (!res->eof)
{
MYSQL *mysql= res->handle;
if (mysql->status != MYSQL_STATUS_USE_RESULT)
{
set_mysql_error(mysql,
res->unbuffered_fetch_cancelled ?
CR_FETCH_CANCELED : CR_COMMANDS_OUT_OF_SYNC,
unknown_sqlstate);
}
else if (!(read_one_row(mysql, res->field_count, res->row, res->lengths)))
{
res->row_count++;
DBUG_RETURN(res->current_row=res->row);
}
DBUG_PRINT("info",("end of data"));
res->eof=1;
mysql->status=MYSQL_STATUS_READY;
/*
Reset only if owner points to us: there is a chance that somebody
started new query after mysql_stmt_close():
*/
if (mysql->unbuffered_fetch_owner == &res->unbuffered_fetch_cancelled)
mysql->unbuffered_fetch_owner= 0;
/* Don't clear handle in mysql_free_result */
res->handle=0;
}
DBUG_RETURN((MYSQL_ROW) NULL);
}
{
MYSQL_ROW tmp;
if (!res->data_cursor)
{
DBUG_PRINT("info",("end of data"));
DBUG_RETURN(res->current_row=(MYSQL_ROW) NULL);
}
tmp = res->data_cursor->data;
res->data_cursor = res->data_cursor->next;
DBUG_RETURN(res->current_row=tmp);
}
}

The correct annotation for sharing read-only storage is /*#observer#*/. For sharing writeable storage, /*#exposed#*/ could be used.
/*#observer#*/ MYSQL_ROW STDCALL mysql_fetch_row(MYSQL_RES *result);
More in the Splint manual.

Related

How to create out-of-tree QEMU devices?

Two possible mechanisms come to mind:
IPC like the existing QMP and QAPI
QEMU loads a shared library plugin that contains the model
Required capabilities (of course all possible through the C API, but not necessarily IPC APIs):
inject interrupts
register callbacks for register access
modify main memory
Why I want this:
use QEMU as a submodule and leave its source untouched
additional advantages only present for IPC methods:
write the models in any language I want
use a non-GPL license for my device
I'm aware of in-tree devices as explained at: How to add a new device in QEMU source code? which are the traditional way of doing things.
What I've found so far:
interrupts: could only find NMI generation with the nmi monitor command
IO ports: IO possible with i and o monitor commands, so I'm fine there
main memory:
the ideal solution would be to map memory to host directly, but that seems hard:
http://kvm.vger.kernel.narkive.com/rto1dDqn/sharing-variables-memory-between-host-and-guest
https://www.linux-kvm.org/images/e/e8/0.11.Nahanni-CamMacdonell.pdf
http://www.fp7-save.eu/papers/SCALCOM2016.pdf
memory read is possible through the x and xp monitor commands
could not find how to write to memory with monitor commands. But I think the GDB API supports, so it should not be too hard to implement.
The closest working piece of code I could find was: https://github.com/texane/vpcie , which serializes PCI on both sides, and sends it through QEMU's TCP API. But this is more inefficient and intrusive, as it requires extra setup on both guest and host.
This create out of tree PCI device , it just display device in lspci..
It will ease faster PCI driver implementation as it will act as module,
can we extend this to to have similar functionality as edu-pci of QEMU.?
https://github.com/alokprasad/pci-hacking/blob/master/ksrc/virtual_pcinet/virtual_pci.c
/*
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/fs.h>
#include <linux/kobject.h>
#include <linux/device.h>
#include <linux/proc_fs.h>
#include <linux/types.h>
#include <linux/pci.h>
#include <linux/version.h>
#include<linux/kernel.h>
#define PCI_VENDOR_ID_XTREME 0x15b3
#define PCI_DEVICE_ID_XTREME_VNIC 0x1450
static struct pci_bus *vbus;
static struct pci_sysdata *sysdata;
static DEFINE_PCI_DEVICE_TABLE( vpci_dev_table) = {
{PCI_DEVICE(PCI_VENDOR_ID_XTREME, PCI_DEVICE_ID_XTREME_VNIC)},
{0}
};
MODULE_DEVICE_TABLE(pci, vpci_dev_table);
int vpci_read(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 *val)
{
switch (where) {
case PCI_VENDOR_ID:
*val = PCI_VENDOR_ID_XTREME | PCI_DEVICE_ID_XTREME_VNIC << 16;
/* our id */
break;
case PCI_COMMAND:
*val = 0;
break;
case PCI_HEADER_TYPE:
*val = PCI_HEADER_TYPE_NORMAL;
break;
case PCI_STATUS:
*val = 0;
break;
case PCI_CLASS_REVISION:
*val = (4 << 24) | (0 << 16) | 1;
/* network class, ethernet controller, revision 1 */ /*2 or 4*/
break;
case PCI_INTERRUPT_PIN:
*val = 0;
break;
case PCI_SUBSYSTEM_VENDOR_ID:
*val = 0;
break;
case PCI_SUBSYSTEM_ID:
*val = 0;
break;
default:
*val = 0;
/* sensible default */
}
return 0;
}
int vpci_write(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 val)
{
switch (where) {
case PCI_BASE_ADDRESS_0:
case PCI_BASE_ADDRESS_1:
case PCI_BASE_ADDRESS_2:
case PCI_BASE_ADDRESS_3:
case PCI_BASE_ADDRESS_4:
case PCI_BASE_ADDRESS_5:
break;
}
return 0;
}
struct pci_ops vpci_ops = {
.read = vpci_read,
.write = vpci_write
};
void vpci_remove_vnic()
{
struct pci_dev *pcidev = NULL;
if (vbus == NULL)
return;
pci_remove_bus_device(pcidev);
pci_dev_put(pcidev);
}
EXPORT_SYMBOL( vpci_remove_vnic);
void vpci_vdev_remove(struct pci_dev *dev)
{
}
static struct pci_driver vpci_vdev_driver = {
.name = "Xtreme-Virtual-NIC1",
.id_table = vpci_dev_table,
.remove = vpci_vdev_remove
};
int vpci_bus_init(void)
{
struct pci_dev *pcidev = NULL;
sysdata = kzalloc(sizeof(void *), GFP_KERNEL);
vbus = pci_scan_bus_parented(NULL, 2, & vpci_ops, sysdata);
//vbus = pci_create_root_bus(NULL,i,& vpci_ops, sysdata,NULL);
//if (vbus != NULL)
//break;
memset(sysdata, 0, sizeof(void *));
if (vbus == NULL) {
kfree(sysdata);
return -EINVAL;
}
if (pci_register_driver(& vpci_vdev_driver) < 0) {
pci_remove_bus(vbus);
vbus = NULL;
return -EINVAL;
}
pcidev = pci_scan_single_device(vbus, 0);
if (pcidev == NULL)
return 0;
else
pci_dev_get(pcidev);
pci_bus_add_devices(vbus);
return 0;
}
void vpci_bus_remove(void)
{
if (vbus) {
pci_unregister_driver(&vpci_vdev_driver);
device_unregister(vbus->bridge);
pci_remove_bus(vbus);
kfree(sysdata);
vbus = NULL;
}
}
static int __init pci_init(void)
{
printk( "module loaded");
vpci_bus_init();
return 0;
}
static void __exit pci_exit(void)
{
printk(KERN_ALERT "unregister PCI Device\n");
pci_unregister_driver(&vpci_vdev_driver);
}
module_init(pci_init);
module_exit(pci_exit);
MODULE_LICENSE("GPL");
There is at least one fork of QEMU I'm aware of that offers shared library plugins for QEMU... but it's a fork of QEMU 4.0.
https://github.com/cromulencellc/qemu-shoggoth
It is possible to build out of tree plugins with this fork, though it's not documented.
On Nov 11 2019 Peter Maydell, a major QEMU contributor, commented on another Stack Overflow question that:
Device plugins are specifically off the menu, because upstream does not want to provide a nice easy mechanism for people to use to have out-of-tree non-GPL/closed-source devices.
So it seems that QEMU devs oppose this idea at that point in time. It is worth learning about the QEMU plugin system though which might come handy for related applications in any case: How to count the number of guest instructions QEMU executed from the beginning to the end of a run?
This is a shame. Imagine if the Linux kernel didn't have a kernel module interface! I suggest QEMU expose this interface, but just don't make it stable, so that it won't impose a developer burden, and which gives the upside that those who merge won't have as painful rebases.

Aggregate results from CUDA threads

I am solving minimal dominant set problem on CUDA. Every thread finds some local candiate result and I need to find the best. I am using __device__ variables for the global result (dev_bestConfig and dev_bestValue).
I need to do something like this:
__device__ configType dev_bestConfig = 0;
__device__ int dev_bestValue = INT_MAX;
__device__ void findMinimalDominantSet(int count, const int *matrix, Lock &lock)
{
// here is some algorithm that finds local bestValue and bestConfig
// set device variables
if (bestValue < dev_bestValue)
{
dev_bestValue = bestValue;
dev_bestConfig = bestConfig;
}
}
I know that this does not work because more threads accesses the memory at the same time so I use this critical section:
// set device variables
bool isSet = false;
do
{
if (isSet = atomicCAS(lock.mutex, 0, 1) == 0)
{
// critical section goes here
if (bestValue < dev_bestValue)
{
dev_bestValue = bestValue;
dev_bestConfig = bestConfig;
}
}
if (isSet)
{
*lock.mutex = 0;
}
} while (!isSet);
This actually works as expected but it is really slow. For example without this critical section it takes 0.1 secodns and with this critical section it takes 1.8 seconds.
What can i do differetly to make it faster?
I actually avoided any critical sections and locking at the end. I saved local results to an array and then searched for the best one. The searching can be done sequentially or by parallel reduction.

How to select a GPU with CUDA?

I have a computer with 2 GPUs; I wrote a CUDA C program and I need to tell it somehow that I want to run it on just 1 out of the 2 graphic cards; what is the command I need to type and how should I use it? I believe somehow that is related to the cudaSetDevice but I can't really find out how to use it.
It should be pretty much clear from documentation of cudaSetDevice, but let me provide following code snippet.
bool IsGpuAvailable()
{
int devicesCount;
cudaGetDeviceCount(&devicesCount);
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
if (deviceProperties.major >= 2
&& deviceProperties.minor >= 0)
{
cudaSetDevice(deviceIndex);
return true;
}
}
return false;
}
This is how I iterated through all available GPUs (cudaGetDeviceCount) looking for the first one of Compute Capability of at least 2.0. If such device was found, then I used cudaSetDevice so all the CUDA computations were executed on that particular device. Without executing the cudaSetDevice your CUDA app would execute on the first GPU, i.e. the one with deviceIndex == 0 but which particular GPU is that depends on which GPU is in which PCIe slot.
EDIT:
After clarifying your question in comments, it seems to me that it should be suitable for you to choose the device based on its name. If you are unsure about your actual GPU names, then run this code which will print names of all your GPUs into console:
int devicesCount;
cudaGetDeviceCount(&devicesCount);
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
cout << deviceProperties.name << endl;
}
After that, choose the name of the GPU that you want to use for computations, lets say it is "GTX XYZ". Call the following method from your main method, thanks to it, all the CUDA kernels will be executed on the device with name "GTX XYZ". You should also check the return value - true if device with such name is found, false otherwise:
bool SetGPU()
{
int devicesCount;
cudaGetDeviceCount(&devicesCount);
string desiredDeviceName = "GTX XYZ";
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
if (deviceProperties.name == desiredDeviceName)
{
cudaSetDevice(deviceIndex);
return true;
}
}
return false;
}
Of course you have to change the value of desiredDeviceName variable to desired value.
Searching more carefully in the internet I found this lines of code that select the GPU with more cores among all the devices installed in the Pc.
int num_devices, device;
cudaGetDeviceCount(&num_devices);
if (num_devices > 1) {
int max_multiprocessors = 0, max_device = 0;
for (device = 0; device < num_devices; device++) {
cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
if (max_multiprocessors < properties.multiProcessorCount) {
max_multiprocessors = properties.multiProcessorCount;
max_device = device;
}
}
cudaSetDevice(max_device);
}

Connect to mysql in a c script?

I'm new to using the gwan server(link) and for that matter programming in c. I wanted to know what was the easiest way to use mysql in a c script for the gwan server?
I've experimented with dbi.c as used here and the project page can be found here, but also found that there is a c API for mysql itself which you can find here.
Anyone have experience using either or both? What are some of the pros/cons? Are there other libraries that make connecting to mysql easy for a noob like myself?
Any help appreciated.
Thanks!
[EDIT]
Also is libdbi thread safe? it appears to not be.
[EDIT 2]
It appears that the mysql lib itself is the easy way to go unless you think might be switching database types later as libdbi appears to be able to have different drivers which is nice for abstraction.
Relating to GWAN for me if i had any "mysql code" in the main function of a handler it appeared to be unsafe and caused random errors intermittently, but if i put the "mysql code" in the init function and put any data i need in a kv store accessed off of one of the global pointers the random errors went away completely. (I was using libdbi i assume it would be the same for the mysql api)
Hope this helps
I always prefer using the native c api...
#pragma link "/usr/lib/libmysqlclient.so"
#include "gwan.h"
#include <mysql/mysql.h>
int
main (int argc, char **argv)
{
MYSQL_RES *result;
MYSQL_ROW row;
MYSQL conn, *conn_h;
conn_h = mysql_init (&conn);
if (!conn_h)
{
return 200;
}
if (!mysql_real_connect (conn_h, "localhost", ctx->usr, ctx->psw, NULL, 0, NULL, 0))
{
mysql_close (conn_h);
return 200;
}
mysql_select_db (conn_h, "");
char *query = "";
if (mysql_query (conn_h, query))
{
mysql_close (conn_h);
return 200;
}
result = mysql_store_result (conn_h);
if (!result)
{
mysql_close (conn_h);
return 200;
}
if (mysql_num_rows (result) == 0)
{
return 200;
}
while ((row = mysql_fetch_row (result)))
{
/* do something with row[i] */
}
mysql_free_result (result);
mysql_close (conn_h);
return 200; // Ok
}
Keep in mind you need to initialize the mysql library if you plan to spawn threads (this code is not thread safe).
Hope this help you someway.

CUDA block synchronization differences between GTS 250 and Fermi devices

So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong).
Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices.
I understand that the warp scheduling and memory architecture between the two platforms are different and I am taking that into account when developing the code. From my understanding, using __theadfence() should make sure any global writes are committed and visible to other blocks, however, from the corrupt hash table, it appears that they are not.
I've also posted the problem on the NVIDIA CUDA developer forum and it can be found here.
Relevant code below:
__device__ void lock(int *mutex) {
while(atomicCAS(mutex, 0, 1) != 0);
}
__device__ void unlock(int *mutex) {
atomicExch(mutex, 0);
}
__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
// Find entry if it exists
unsigned int hashValue = hash(key, table.count);
lock(&globalHashLocks[hashValue]);
int bucketHead = table.entries[hashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = table.pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
lock(globalFreeLock);
int newLocation = (*globalFirstFree)++;
__threadfence();
unlock(globalFreeLock);
Entry newEntry;
newEntry.data.x = key;
newEntry.data.y = count;
newEntry.data.z = sum;
newEntry.data.w = sumSquared;
newEntry.next = bucketHead;
// Add entry to table
table.pool[newLocation] = newEntry;
table.entries[hashValue] = newLocation;
} else {
currentEntry.data.y += count;
currentEntry.data.z += sum;
currentEntry.data.w += sumSquared;
table.pool[currentLocation] = currentEntry;
}
__threadfence();
unlock(&globalHashLocks[hashValue]);
}
As pointed out by LSChien in this post, the issue is with L1 cache coherency. While using __threadfence() will guarantee shared and global memory writes are visible to other threads, since it is not atomic, thread x in block 1 may reach a cached memory value until thread y in block 0 has executed to the threadfence instruction. Instead LSChien suggested a hack in his post of using an atomicCAS() to force the thread to read from global memory instead of a cached value. The proper way to do this is by declaring the memory as volatile, requiring that every write to that memory be visible to all other threads in the grid immediately.
__threadfence guarantees that writes to global memory are visible to other threads in the current block before returning. That is not the same as "write operation on global memory is complete"! Think caching on each multicore.