I have a set of strings that's JSONish, but totally JSON uncompliant. It's also kind of CSV, but values themselves sometimes have commas.
The strings look like this:
ATTRIBUTE: Value of this attribute, ATTRIBUTE2: Another value, but this one has a comma in it, ATTRIBUTE3:, another value...
The only two patterns I can see that would mostly work are that the attribute names are in caps and followed by a : and space. After the first attribute, the pattern is , name-in-caps : space.
The data is stored in Redshift, so I was going to see if I can use regex to resolved this, but my regex knowledge is limited - where would I start?
If not, I'll resort to python hacking.
What you're describing would be something like:
^([A-Z\d]+?): (.*?), ([A-Z\d]+?): (.*?), ([A-Z\d]+?): (.*)$
Though this answer would imply your third attribute value doesn't really start with a comma, and that your attributes name could countain numbers.
If we take this appart:
[A-Z\d] Capital letters and numbers
+?: As many as needed, up to the first :
(.*?), a space, then as many characters as needed up to a coma and a space
^ and $ The begining and the end of a string, respectively
And the rest is a repetition of that pattern.
The ( ) are just meant to identify your capture sections, in this case, they don't impact directly the match.
Here's a working example
Often regex is not the right tool to use when it seems like it is.
Read this thoughtful post for details: https://softwareengineering.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems
When a simpler scheme will do, use it! Here is one scheme that would successfully parse the structure as long as colons only occur between attributes and values, and not in them:
Code
static void Main(string[] args)
{
string data = "ATTRIBUTE: Value of this attribute,ATTRIBUTE2: Another value, but this one has a comma in it,ATTRIBUTE3:, another value,value1,ATTRIBUTE4:end of file";
Console.WriteLine();
Console.WriteLine("As an String");
Console.WriteLine();
Console.WriteLine(data);
string[] arr = data.Split(new[] { ":" }, StringSplitOptions.None);
Dictionary<string, string> attributeNameToValue = new Dictionary<string, string>();
Console.WriteLine();
Console.WriteLine("As an Array Split on ':'");
Console.WriteLine();
Console.WriteLine("{\"" + String.Join("\",\"", arr) + "\"}");
string currentAttribute = null;
string currentValue = null;
for (int i = 0; i < arr.Length; i++)
{
if (i == 0)
{
// The first element only has the first attribute name
currentAttribute = arr[i].Trim();
}
else if (i == arr.Length - 1)
{
// The last element only has the final value
attributeNameToValue[currentAttribute] = arr[i].Trim();
}
else
{
int indexOfLastComma = arr[i].LastIndexOf(",");
currentValue = arr[i].Substring(0, indexOfLastComma).Trim();
string nextAttribute = arr[i].Substring(indexOfLastComma + 1).Trim();
attributeNameToValue[currentAttribute] = currentValue;
currentAttribute = nextAttribute;
}
}
Console.WriteLine();
Console.WriteLine("As a Dictionary");
Console.WriteLine();
foreach (string key in attributeNameToValue.Keys)
{
Console.WriteLine(key + " : " + attributeNameToValue[key]);
}
}
Output:
As an String
ATTRIBUTE: Value of this attribute,ATTRIBUTE2: Another value, but this one has a comma in it,ATTRIBUTE3:, another value,value1,ATTRIBUTE4:end of file
As an Array Split on ':'
{"ATTRIBUTE"," Value of this attribute,ATTRIBUTE2"," Another value, but this one has a comma in it,ATTRIBUTE3",", another value,value1,ATTRIBUTE4","end of file"}
As a Dictionary
ATTRIBUTE : Value of this attribute
ATTRIBUTE2 : Another value, but this one has a comma in it
ATTRIBUTE3 : , another value,value1
ATTRIBUTE4 : end of file
Related
I am trying to remove [" from beginning of the string and "] end of the string by using REPLACE function in derived column. But it is giving an error.
I have used the below formula
REPLACE(columnanme,"["","")
is used in the to remove [" in the beginning of the string. But not working.
Can someone help me on this.
Note: Data is in table and datatype is NTEXT
Regards,
Khatija
I believe you just need to escape the " value
so
\”
REPLACE(columnanme,"[\"","")
otherwise it sees the " in the middle as the closing quote and you have an invalid statement.
I am trying to remove [" from beginning of the string and "] end of the string
Supposing that we reliably have brackets and quotes wrapping the data, the simplest approach would be to use substring. This would be easier to do in SQL:
UPDATE myTable SET columnname = SUBSTRING(columnname, 3, LEN(columnname) -4)
WHERE columnname LIKE '["%"]'
If you want to do this in SSIS, you'll need to use a script component transformation to avoid data loss when converting the value to a string. Select the column you want to work with and set the usage type to ReadWrite:
In the script, I have added a method GetNewString, which converts the blob to a string and strips the unwanted characters. You can also use Replace or Regex.Replace if that makes more sense.
In the Input0_ProcessInputRow method, we convert the columns data, reset the blob and then add the new value:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var input = GetNewString(Row.columname);
Row.columname.ResetBlobData();
Row.columname.AddBlobData(System.Text.Encoding.Unicode.GetBytes(input));
}
public string GetNewString(Microsoft.SqlServer.Dts.Pipeline.BlobColumn blobColumn)
{
if (blobColumn.IsNull)
return string.Empty;
var blobData = blobColumn.GetBlobData(0, (int)blobColumn.Length);
var stringData = System.Text.Encoding.Unicode.GetString(blobData);
stringData = stringData.Substring(2, stringData.Length - 4);
return stringData;
}
I receive a Unicode text flat-file in which one column is a single fixed-length value, and the other contains a list values delimited by a vertical pipe '|'. The length of the second column and the number of delimited values it contains will vary greatly. In some cases the column will be up to 50000 characters wide, and could contain a thousand or more delimited values.
Input file Example:
[ObjectGUID]; [member]
{BD3481AF8-2CDG-42E2-BA93-73952AFB41F3}; CN=rGlynn SrechrshiresonIII,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3}; CN=reeghler Johnson,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp|CN=rCoefler Cellins,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp|CN=rDasije M. Delmogeroo,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp|CN=rCurry T. Carrollton,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp|CN=yMica Macintosh,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
My idea is to perform a Split operation on this column and create a new row for each value. I am attempting to use a script component to perform the split.
The width of the delimited column can easily exceed the 4000 character limit of DT-WSTR, so I chose NTEXT as the datatype. This presents problem because the .Split method I am familar with requires a string. I am attempting to convert the NTEXT to a string in the script component.
Here is my code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var stringMember = Row.member.ToString();
var groupMembers = stringMember.Split('|');
foreach (var groupMember in groupMembers)
{
this.Output0Buffer.AddRow();
this.Output0Buffer.objectGUID = Row.objectGUID;
this.Output0Buffer.member = groupMember;
}
}
The output I am trying to get would be this:
[ObjectGUID] [member]
{BD3481AF8-2CDG-42E2-BA93-73952AFB41F3} CN=rGlynn SrechrshiresonIII,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3} CN=reeghler Johnson,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3} CN=rCoefler Cellins,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3} CN=rDasije M. Delmogeroo,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3} CN=rCurry T. Carrollton,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3} CN=yMica Macintosh,OU=Users,OU=PRV,OU=LOL,DC=ent,DC=keke,DC=cqb,DC=corp
But what I am in fact getting is this:
[ObjectGUID] [member]
{BD3481AF8-2CDG-42E2-BA93-73952AFB41F3} Microsoft.SqlServer.Dts.Pipeline.BlobColumn
{AC365A4F8-2CDG-42E2-BA33-73933AFB41F3} Microsoft.SqlServer.Dts.Pipeline.BlobColumn
What might I be doing wrong?
The following code worked:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var blobLength = Convert.ToInt32(Row.member.Length);
var blobData = Row.member.GetBlobData(0, blobLength);
var stringData = System.Text.Encoding.Unicode.GetString(Row.member.GetBlobData(0, Convert.ToInt32(Row.member.Length)));
var groupMembers = stringData.Split('|');
foreach (var groupMember in groupMembers)
{
this.Output0Buffer.AddRow();
this.Output0Buffer.CN = Row.CN;
this.Output0Buffer.ObjectGUID = Row.ObjectGUID;
this.Output0Buffer.member = groupMember;
}
}
I was trying to perform an implicit conversion as I would in PowerShell, but was actually just passing some object metadata to the string output. This method properly splits my members and builds a complete row.
Ok if you don't under stand the title, let me example.
lets say I have a variable called "money" which is in class "wallet". Normally I would just do this to get the value;
trace(wallet.money);
but if I had 10 different money variables like money_1 , money_2 etc...
so can i make a string which original value as "wallet.money_" then just add the number at the end. so the function would look like this
public function getmoney(num:Number):Number
{
var word:String = "wallet.money_" + num.toString();
return // this would be where i return the value of the money variable.
}
is this possible or not?
You can reference it like this:
wallet["money_" + i]
i is an int - use int for index number not Number.
I am using libxml's HTML parser to create a dom tree of html documents. libxml gives text content of each node as a monolithic string (node), but my requirement is to further split each text node at spaces and create as many as word nodes. thus far I haven't found any options from libxml so I created a cpu expensive logic to split text nodes. Below is the part of recursive method that works.
void parse(xmlNodePtr cur, El*& parent) {
if (!cur) {
return;
}
string tagName = (const char*) cur->name;
string content = node_text(cur); // function defined below
Element* el = new Element(tagName, content);
parent->childs.push_back(el);
size_t pos;
string text;
cur = cur->children;
while (cur != NULL) {
if (xmlNodeIsText(cur) && (pos = node_text_find(cur, text, " ")) != string::npos) {
string first = text.substr(0, pos);
string second = text.substr(pos + 1);
El *el1 = new Element("text", first);
el->childs.push_back(el1);
El *el2 = new Element("text", " ");
el->childs.push_back(el2);
xmlNodeSetContent(cur, BAD_CAST second.c_str());
continue;
}
parse(cur, el);
cur = cur->next;
}
}
string node_text(xmlNodePtr cur) {
string content;
if (xmlNodeIsText(cur)) {
xmlChar *buf = xmlNodeGetContent(cur);
content = (const char*) buf;
}
return content;
}
size_t node_text_find(xmlNodePtr cur, string& text, string what){
text = node_text(cur);
return text.find_first_of(what);
}
The problem with above code is it didnt work for some UTF string like chinese language and moreover this code adds up time in overall parsing process.
Can anyone suggest a better way of doing this, thank you in advance !
I don't have a complete answer but I did see you doing explicit casts of xmlChar to char. That is a bad sign and probably why it doesn't work on Unicode.
If you're dealing with Unicode, which xmlChar probably is, you need to be using Unicode text processing libraries. Not std::string.
You actually have two choices. Find a library which processes in UTF-8 or convert UTF-8 into wchar (wide characters). If you convert to wchar then you can use wstring and its functions to process Unicode.
libxml2 xmlChar * to std::wstring looks like a useful answer.
As for speed, do my eyes deceive me or are you splitting on one space and creating a new element which you then split again? This is not the way to performance. I think it would go better if you remove the text node, split all of the words out and add the new nodes as you go.
The slowdown is most likely in the repeated creation, copying and destruction of objects. Work to minimize that. For example, if Element had a constructor form that accepted a begin/end iterator pair, or a start, length pair, that would be more efficient than creating a substring (copy!) and creating an Element (copy!) and then destroying the substrings.
The repeated calling of xmlNodeSetContent with the (probably large) second half of the text string, is giving you O2 performance. Not good.
In my application iOS I need to export some data into CSV or HTML format. How can I do this?
RegexKitLite comes with an example of how to read a csv file into an NSArray of NSArrays, and to go in the reverse direction is pretty trivial.
It'd be something like this (warning: code typed in browser):
NSArray * data = ...; //An NSArray of NSArrays of NSStrings
NSMutableString * csv = [NSMutableString string];
for (NSArray * line in data) {
NSMutableArray * formattedLine = [NSMutableArray array];
for (NSString * field in line) {
BOOL shouldQuote = NO;
NSRange r = [field rangeOfString:#","];
//fields that contain a , must be quoted
if (r.location != NSNotFound) {
shouldQuote = YES;
}
r = [field rangeOfString:#"\""];
//fields that contain a " must have them escaped to "" and be quoted
if (r.location != NSNotFound) {
field = [field stringByReplacingOccurrencesOfString:#"\"" withString:#"\"\""];
shouldQuote = YES;
}
if (shouldQuote == YES) {
[formattedLine addObject:[NSString stringWithFormat:#"\"%#\"", field]];
} else {
[formattedLine addObject:field];
}
}
NSString * combinedLine = [formattedLine componentsJoinedByString:#","];
[csv appendFormat:#"%#\n", combinedLine];
}
[csv writeToFile:#"/path/to/file.csv" atomically:NO];
The general solution is to use stringWithFormat: to format each row. Presumably, you're writing this to a file or socket, in which case you would write a data representation of each string (see dataUsingEncoding:) to the file handle as you create it.
If you're formatting a lot of rows, you may want to use initWithFormat: and explicit release messages, in order to avoid running out of memory by piling up too many string objects in the autorelease pool.
And always, always, always remember to escape the values correctly before passing them to the formatting method.
Escaping (along with unescaping) is a really good thing to write unit tests for. Write a function to CSV-format a single row, and have test cases that compare its result to correct output. If you have a CSV parser on hand, or you're going to need one, or you just want to be really sure your escaping is correct, write unit tests for the parsing and unescaping as well as the escaping and formatting.
If you can start with a single record containing any combination of CSV-special and/or SQL-special characters, format it, parse the formatted string, and end up with a record equal to the one you started with, you know your code is good.
(All of the above applies equally to CSV and to HTML. If possible, you might consider using XHTML, so that you can use XML validation tools and parsers, including NSXMLParser.)
CSV - comma separated values.
I usually just iterate over the data structures in my application and output one set of values per line, values within set separated with comma.
struct person
{
string first_name;
string second_name;
};
person tony = {"tony", "momo"};
person john = {"john", "smith"};
would look like
tony, momo
john, smith