libxml split text nodes at spaces - html

I am using libxml's HTML parser to create a dom tree of html documents. libxml gives text content of each node as a monolithic string (node), but my requirement is to further split each text node at spaces and create as many as word nodes. thus far I haven't found any options from libxml so I created a cpu expensive logic to split text nodes. Below is the part of recursive method that works.
void parse(xmlNodePtr cur, El*& parent) {
if (!cur) {
return;
}
string tagName = (const char*) cur->name;
string content = node_text(cur); // function defined below
Element* el = new Element(tagName, content);
parent->childs.push_back(el);
size_t pos;
string text;
cur = cur->children;
while (cur != NULL) {
if (xmlNodeIsText(cur) && (pos = node_text_find(cur, text, " ")) != string::npos) {
string first = text.substr(0, pos);
string second = text.substr(pos + 1);
El *el1 = new Element("text", first);
el->childs.push_back(el1);
El *el2 = new Element("text", " ");
el->childs.push_back(el2);
xmlNodeSetContent(cur, BAD_CAST second.c_str());
continue;
}
parse(cur, el);
cur = cur->next;
}
}
string node_text(xmlNodePtr cur) {
string content;
if (xmlNodeIsText(cur)) {
xmlChar *buf = xmlNodeGetContent(cur);
content = (const char*) buf;
}
return content;
}
size_t node_text_find(xmlNodePtr cur, string& text, string what){
text = node_text(cur);
return text.find_first_of(what);
}
The problem with above code is it didnt work for some UTF string like chinese language and moreover this code adds up time in overall parsing process.
Can anyone suggest a better way of doing this, thank you in advance !

I don't have a complete answer but I did see you doing explicit casts of xmlChar to char. That is a bad sign and probably why it doesn't work on Unicode.
If you're dealing with Unicode, which xmlChar probably is, you need to be using Unicode text processing libraries. Not std::string.
You actually have two choices. Find a library which processes in UTF-8 or convert UTF-8 into wchar (wide characters). If you convert to wchar then you can use wstring and its functions to process Unicode.
libxml2 xmlChar * to std::wstring looks like a useful answer.
As for speed, do my eyes deceive me or are you splitting on one space and creating a new element which you then split again? This is not the way to performance. I think it would go better if you remove the text node, split all of the words out and add the new nodes as you go.
The slowdown is most likely in the repeated creation, copying and destruction of objects. Work to minimize that. For example, if Element had a constructor form that accepted a begin/end iterator pair, or a start, length pair, that would be more efficient than creating a substring (copy!) and creating an Element (copy!) and then destroying the substrings.
The repeated calling of xmlNodeSetContent with the (probably large) second half of the text string, is giving you O2 performance. Not good.

Related

Parsing a String That's Kind of JSON

I have a set of strings that's JSONish, but totally JSON uncompliant. It's also kind of CSV, but values themselves sometimes have commas.
The strings look like this:
ATTRIBUTE: Value of this attribute, ATTRIBUTE2: Another value, but this one has a comma in it, ATTRIBUTE3:, another value...
The only two patterns I can see that would mostly work are that the attribute names are in caps and followed by a : and space. After the first attribute, the pattern is , name-in-caps : space.
The data is stored in Redshift, so I was going to see if I can use regex to resolved this, but my regex knowledge is limited - where would I start?
If not, I'll resort to python hacking.
What you're describing would be something like:
^([A-Z\d]+?): (.*?), ([A-Z\d]+?): (.*?), ([A-Z\d]+?): (.*)$
Though this answer would imply your third attribute value doesn't really start with a comma, and that your attributes name could countain numbers.
If we take this appart:
[A-Z\d] Capital letters and numbers
+?: As many as needed, up to the first :
(.*?), a space, then as many characters as needed up to a coma and a space
^ and $ The begining and the end of a string, respectively
And the rest is a repetition of that pattern.
The ( ) are just meant to identify your capture sections, in this case, they don't impact directly the match.
Here's a working example
Often regex is not the right tool to use when it seems like it is.
Read this thoughtful post for details: https://softwareengineering.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems
When a simpler scheme will do, use it! Here is one scheme that would successfully parse the structure as long as colons only occur between attributes and values, and not in them:
Code
static void Main(string[] args)
{
string data = "ATTRIBUTE: Value of this attribute,ATTRIBUTE2: Another value, but this one has a comma in it,ATTRIBUTE3:, another value,value1,ATTRIBUTE4:end of file";
Console.WriteLine();
Console.WriteLine("As an String");
Console.WriteLine();
Console.WriteLine(data);
string[] arr = data.Split(new[] { ":" }, StringSplitOptions.None);
Dictionary<string, string> attributeNameToValue = new Dictionary<string, string>();
Console.WriteLine();
Console.WriteLine("As an Array Split on ':'");
Console.WriteLine();
Console.WriteLine("{\"" + String.Join("\",\"", arr) + "\"}");
string currentAttribute = null;
string currentValue = null;
for (int i = 0; i < arr.Length; i++)
{
if (i == 0)
{
// The first element only has the first attribute name
currentAttribute = arr[i].Trim();
}
else if (i == arr.Length - 1)
{
// The last element only has the final value
attributeNameToValue[currentAttribute] = arr[i].Trim();
}
else
{
int indexOfLastComma = arr[i].LastIndexOf(",");
currentValue = arr[i].Substring(0, indexOfLastComma).Trim();
string nextAttribute = arr[i].Substring(indexOfLastComma + 1).Trim();
attributeNameToValue[currentAttribute] = currentValue;
currentAttribute = nextAttribute;
}
}
Console.WriteLine();
Console.WriteLine("As a Dictionary");
Console.WriteLine();
foreach (string key in attributeNameToValue.Keys)
{
Console.WriteLine(key + " : " + attributeNameToValue[key]);
}
}
Output:
As an String
ATTRIBUTE: Value of this attribute,ATTRIBUTE2: Another value, but this one has a comma in it,ATTRIBUTE3:, another value,value1,ATTRIBUTE4:end of file
As an Array Split on ':'
{"ATTRIBUTE"," Value of this attribute,ATTRIBUTE2"," Another value, but this one has a comma in it,ATTRIBUTE3",", another value,value1,ATTRIBUTE4","end of file"}
As a Dictionary
ATTRIBUTE : Value of this attribute
ATTRIBUTE2 : Another value, but this one has a comma in it
ATTRIBUTE3 : , another value,value1
ATTRIBUTE4 : end of file

Using strcmp() in my cgi code, for an html webpage, is causing a server error

I am making an html webpage that uses cgi to access a table/database in mysql. I input a .csv file containing info on my class schedule and the html displays it in the usual schedule table.
My problem is that I can't seem to use strcmp in my parsing cgi as it causes a server error. here is an excerpt of my code where I uses strcmp.
void parse2(char *queu)
{
//---------------------------------------------------------------
char *saveptr[1024];
char *subtoken;
char *Subject;
char *Day;
char *Start;
char *End;
char *Room;
char *Teacher;
int check = 1;
//---------------------------------------------------------------
subtoken = strtok_r(queu, ",", saveptr);
check = strcmp(subtoken, "\0");
printf("%d<br>", check);
if(check == 0){
printf("Error!");
} else {
Subject = subtoken;
Day = strtok_r(NULL, ",", saveptr);
Start = strtok_r(NULL, ",", saveptr);
End = strtok_r(NULL, ",", saveptr);
Room = strtok_r(NULL, ",", saveptr);
Teacher = strtok_r(NULL, ",", saveptr);
printf("%s\n<br/>%s\n<br/>%s\n<br/>%s\n<br/>%s\n<br/>%s\n", Subject, Day, Start, End, Room, Teacher);
//inputsql(Subject, Day, Start, End, Room, Teacher);
}
//---------------------------------------------------------------
}
Note that, I have tested this code and it works fine without me calling strcmp().
I am using strcmp() to prevent a line of unwanted characters, generated after the info when retrieved using POST method, from being entered into my database.
As you can see from the above code, I used strtok() to parse the line of info. Since the line of unwanted characters do not contain a comma (which is my delimiter) it should return a NULL value. correct?
Can anyone help me out? I welcome suggestions to use a different way on solving the problem I chose to solve using strcmp().
I think you should be checking subtoken == NULL, not strcmp(subtoken, "\0") == 0.
"\0" is a string containing a NUL byte, then another NUL (the terminator), so the standard library's string functions will just see an empty string. That's different to a NULL pointer (i.e. a pointer with value zero).
From STRTOK(3):
Each call to strtok() returns a pointer to a null-terminated string
containing the next token. This string does not include the
delimiting byte. If no more tokens are found, strtok() returns NULL.

Extracting integers from a query string

I am creating a program that can make mysql transactions through C and html.
I have this query string
query = -id=103&-id=101&-id=102&-act=Delete
Extracting "Delete" by sscanf isn't that hard, but I need help extracting the integers and putting them in an array of int id[]. The number of -id entries can vary depending on how many checkboxes were checked in the html form.
I've been searching for hours but haven't found any applicable solution; or I just did not understand them. Any ideas?
Thanks
You can use strstr and atoi to extract the numbers in a loop, like this:
char *query = "-id=103&-id=101&-id=102&-act=Delete";
char *ptr = strstr(query, "-id=");
if (ptr) {
ptr += 4;
int n = atoi(ptr);
printf("%d\n", n);
for (;;) {
ptr = strstr(ptr, "&-id=");
if (!ptr) break;
ptr += 5;
int n = atoi(ptr);
printf("%d\n", n);
}
}
Demo on ideone.
You want to use strtok or a better solution, to tokenize this string with & and = as tokens.
Take a look at cplusplus.com for more information and an example.
This is the output you would get from strtok
Output:
Splitting string "- This, a sample string." into tokens:
This
a
sample
string
Once you figure out how to split them, the next hurdle is to convert the numbers from strings to ints. For this you need to look at atoi or its safer more robust cousin strtol
Most likely I would write a small lexical scanner to tackle the task. Meaning, I would analyze the string one character at a time, according to a regular expression representing the set of possible inputs.

Should I avoid magic strings as possible?

I have the next piece of code:
internal static string GetNetBiosDomainFromMember(string memberName)
{
int indexOf = memberName.IndexOf("DC=", StringComparison.InvariantCultureIgnoreCase);
indexOf += "DC=".Length;
string domaninName = memberName.Substring(indexOf, memberName.Length - indexOf);
if (domaninName.Contains(","))
{
domaninName = domaninName.Split(new[] { "," }, StringSplitOptions.None)[0];
}
return domaninName;
}
I am making some parsings for AD, so I have some strings like "DC=", "objectCategory=", "LDAP://", ",", "." so and so.
I found the above code more readable than the code below:(You may found the opposed, let' me know.)
private const string DcString = "DC=";
private const string Comma = ",";
internal static string GetNetBiosDomainFromMember(string memberName)
{
int indexOf = memberName.IndexOf(DcString, StringComparison.InvariantCultureIgnoreCase);
indexOf += DcString.Length;
string domaninName = memberName.Substring(indexOf, memberName.Length - indexOf);
if (domaninName.Contains(CommaString))
{
domaninName = domaninName.Split(new[] { CommaString }, StringSplitOptions.None)[0];
}
return domaninName;
}
Even I may have "DC" and "DC=", I should think in the names for this variables or divide these in two :(. Then my question:
Should I avoid magic strings as possible?
UPDATED.
Some conclusions:
There are ways to avoid using strings at all, which might be better. To achieve it could be used: static classes, enumerators, numeric constants, IOC containers and even reflection.
A constant string help you to ensure you don't have any typos (in all references to a string).
Constant strings for punctuation don't have any global semantic. Would be more readable to use these as they are ",". Use a constant for this case may be considered if that constant may change in the future, like change "," by "." (Have a constant may help you in that refactoring although modern tools as resharper do this without need of a constant or variable).
If you only use it string once you do not need to make it into a constant. Consider however that a constant can be documented and shows up in documentation (as Javadocs). This may be important for non-trivial string values.
I would certainly make constants for the actual names like "DC" and "objectCategory", but not for the punctuation. The point of this is to make sure you don't have any typos and such and that you can easily find all of the references for the places that use that magic string. The punctuation is not really part of that.
Just to be clear, I'm assuming the magic strings are things that you have to deal with, that you don't have the option of making them a number defined by a constant. As in the comment to your question, that's always preferable if that's possible. But sometimes you must use a string if you have to interface with some other system that requires it.

How to convert data to CSV or HTML format on iOS?

In my application iOS I need to export some data into CSV or HTML format. How can I do this?
RegexKitLite comes with an example of how to read a csv file into an NSArray of NSArrays, and to go in the reverse direction is pretty trivial.
It'd be something like this (warning: code typed in browser):
NSArray * data = ...; //An NSArray of NSArrays of NSStrings
NSMutableString * csv = [NSMutableString string];
for (NSArray * line in data) {
NSMutableArray * formattedLine = [NSMutableArray array];
for (NSString * field in line) {
BOOL shouldQuote = NO;
NSRange r = [field rangeOfString:#","];
//fields that contain a , must be quoted
if (r.location != NSNotFound) {
shouldQuote = YES;
}
r = [field rangeOfString:#"\""];
//fields that contain a " must have them escaped to "" and be quoted
if (r.location != NSNotFound) {
field = [field stringByReplacingOccurrencesOfString:#"\"" withString:#"\"\""];
shouldQuote = YES;
}
if (shouldQuote == YES) {
[formattedLine addObject:[NSString stringWithFormat:#"\"%#\"", field]];
} else {
[formattedLine addObject:field];
}
}
NSString * combinedLine = [formattedLine componentsJoinedByString:#","];
[csv appendFormat:#"%#\n", combinedLine];
}
[csv writeToFile:#"/path/to/file.csv" atomically:NO];
The general solution is to use stringWithFormat: to format each row. Presumably, you're writing this to a file or socket, in which case you would write a data representation of each string (see dataUsingEncoding:) to the file handle as you create it.
If you're formatting a lot of rows, you may want to use initWithFormat: and explicit release messages, in order to avoid running out of memory by piling up too many string objects in the autorelease pool.
And always, always, always remember to escape the values correctly before passing them to the formatting method.
Escaping (along with unescaping) is a really good thing to write unit tests for. Write a function to CSV-format a single row, and have test cases that compare its result to correct output. If you have a CSV parser on hand, or you're going to need one, or you just want to be really sure your escaping is correct, write unit tests for the parsing and unescaping as well as the escaping and formatting.
If you can start with a single record containing any combination of CSV-special and/or SQL-special characters, format it, parse the formatted string, and end up with a record equal to the one you started with, you know your code is good.
(All of the above applies equally to CSV and to HTML. If possible, you might consider using XHTML, so that you can use XML validation tools and parsers, including NSXMLParser.)
CSV - comma separated values.
I usually just iterate over the data structures in my application and output one set of values per line, values within set separated with comma.
struct person
{
string first_name;
string second_name;
};
person tony = {"tony", "momo"};
person john = {"john", "smith"};
would look like
tony, momo
john, smith