What data format is this? - html

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>

I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.

It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

Related

How to realize a context search based on synomyns?

Lets say an internet user searches for "trouble with gmail".
How can I return entries with "problem|problems|issues|issue|trouble|troubles with gmail|googlemail|google mail"?
I don't like to manually add these linkings between different keywords so the links between "issue <> problem <> trouble" and "gmail <> googlemail <> google mail" are completly unknown. They should be found in an automated process.
Approach to solve the problem
I provide a synonyms/thesaurus plattform like thesaurus.com, synonym.com, etc. or use an synomys database/api and use this user generated input for my queries on a third website.
But this won't cover all synonyms like the "gmail"-example.
Which other options do I have? Maybe something based on the given data and logged search phrases of the past?
You have to think of it ignoring the language.
When you show a baby the same thing using two words, he understand that those words are synonym. He might not have understood perfectly, but he will learn when this is repeated.
You type "problem with gmail".
Two choices:
Your search give results: you click on one item.
The system identify that this item was already clicked before when searching for "google mail bug". That's a match, and we will call it a "relative search".
Your search give poor results:
We will search in our history for a matching search:
We propose : "do you mean trouble with yahoo mail? yes/no". You click no, that's a "no match". And we might propose others suggestions like a list of known "relative search" or a list of might be related playing with both full text search in our history and levenshtein distance.
When a term is sufficiently scored to be considered as a "synonym", you can consider it is. Algorithm might be wrong, but in fact it depends on what you really expect.
If i search "sending a message is difficult with google", and "gmail issue", nothing is synonym, but search are relatively the same. This is more important to me than true synonyms.
And if you really want to get the synonym, i would do it in a second phase comparing words inside "relative searches" and would include a manual check.
I think google algorithm use synonym mainly to highlight search terms in page result, but not to do an actual search where they use the relative search terms, except in known situations, as the result for "gmail" and "google mail" are not the same.
But if you identify 10 relative searches for "gmail" which all contains "google mail", that will be a good start point to guess they are synonyms.
This is a bit long for a comment.
What you are looking for is called a "thesaurus" or "synonyms" list in the world of text searching. Apparently, there is a proposal for such functionality in MySQL. It is not yet implemented. (Here is a related question on Stack Overflow, although the link in the question doesn't seem to work.)
The work-around would be to modify queries before sending them to the database. That is, parse the query into words, then look up all the synonyms for those words, and reconstruct the query. This works better for the natural language searches than the boolean searches (which require more careful reconstruction).
Pseudo-code for getting the final word list with synonyms would be something like:
select #finalwords = concat_ws(' ', group_concat(synonyms separator ' ') )
from synonyms s
where find_in_set(s.baseword, #words) > 0;
Seems to me that you have two problems on your hands:
Lemmatisation, which breaks words down into their lemma, sometimes called the headword or root word. This is more difficult than Stemming, as it doesn't just chop suffixes off of words, but tries to find a true root, e.g. "are" => "be". This is something that is often done programatically, although it appears to be a complex task. Here is an online example of text being lemmatized: http://lemmatise.ijs.si/Services
Searching for synonymous lemmas. This is a very complex problem. One approach to this that I have heard of is modifying the lemmatisation engine to return more than one lemma for a given set of words, i.e. "problems" => "problem" and "issue", thereby allowing a more flexible set of results. However, this means that the synonymous lemmas must be provided to the lemmatisation engine from elsewhere. I truly have no idea how you would build a list of synonyms programatically.
So, you may consider a strategy whereby you lemmatise the text to be searched for, then pass each lemma out to your synonym finder (however that works) to get a final list of lemmas to perform your search with.
I think you have bitten off a very large problem for yourself.
If the system in question is a publicly accessible website, one 'out there' option is to ensure all content can be crawled by Google and then use a Google search on your own site, which should give you the synonym capability 'for free'. There would obviously be some vagaries in the results though and lag in getting match results for newly created content, depending upon how regularly the crawlers hit the site. Probably not suitable in your use case, but for some people, this may be sufficient.
Seeing your revised question, what about using a public API?
http://www.programmableweb.com/category/reference/apis?category=20066&keyword=synonym

SSRS (inconsistently) replaces special characters with HTML entity codes

Our UI has a few fields which are free-text boxes, generally for the user to added descriptions and comments about an item that other form elements do not cover. The users have entered a variety of things, sometimes including characters, and so far the MSSQL 2008 r2 database has handled it well.
Now, we have added SSRS reports to the application, and some users are finding when the report runs, that special characters are being replaced, by what the users call "garbledygook" by which I can identify as HTML entity codes. A couple specific examples:
As in the UI/DB: "... we need to have R&D evaluate ..."
On the report: "... we need to have R&D evaluate ..."
As in the UI/DB: "I suggest "rapid" utilization of ..."
On the report: "I suggest "rapid" utilization of of ..."
As in the UI/DB: "Updating the savings values.(carriage return)Also revised ..."
On the report: "Updating the savings values.
Also revised ..."
The trouble is with the emphasis on some users. Preliminary indications are that IE8 is among the offenders, but not all IE8 users have seen this, and none of us DEVs can replicate it in any of our environments.
So two questions really, what is the cause? And, what is the solution?
You could HTML decode it in the report. On the report properties references tab you have to add the assembly System.Web. Then you can use the expression:
=System.Web.HttpUtility.HTMLDecode(Fields!MyField.Value)
Bit of an old question, but you can set any given textbox to interpret HTML as styling by modifying the placeholder settings as here https://msdn.microsoft.com/en-us/library/dd207057.aspx
This includes ® character codes, <b></b> and <i></i> for some basics to work with.

When could a CSV records *not* have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.
Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.
As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >
Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

Why shouldn't I use "Hungarian Notation"?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I know what Hungarian refers to - giving information about a variable, parameter, or type as a prefix to its name. Everyone seems to be rabidly against it, even though in some cases it seems to be a good idea. If I feel that useful information is being imparted, why shouldn't I put it right there where it's available?
See also: Do people use the Hungarian naming conventions in the real world?
vUsing adjHungarian nnotation vmakes nreading ncode adjdifficult.
Most people use Hungarian notation in a wrong way and are getting wrong results.
Read this excellent article by Joel Spolsky: Making Wrong Code Look Wrong.
In short, Hungarian Notation where you prefix your variable names with their type (string) (Systems Hungarian) is bad because it's useless.
Hungarian Notation as it was intended by its author where you prefix the variable name with its kind (using Joel's example: safe string or unsafe string), so called Apps Hungarian has its uses and is still valuable.
Joel is wrong, and here is why.
That "application" information he's talking about should be encoded in the type system. You should not depend on flipping variable names to make sure you don't pass unsafe data to functions requiring safe data. You should make it a type error, so that it is impossible to do so. Any unsafe data should have a type that is marked unsafe, so that it simply cannot be passed to a safe function. To convert from unsafe to safe should require processing with some kind of a sanitize function.
A lot of the things that Joel talks of as "kinds" are not kinds; they are, in fact, types.
What most languages lack, however, is a type system that's expressive enough to enforce these kind of distinctions. For example, if C had a kind of "strong typedef" (where the typedef name had all the operations of the base type, but was not convertible to it) then a lot of these problems would go away. For example, if you could say, strong typedef std::string unsafe_string; to introduce a new type unsafe_string that could not be converted to a std::string (and so could participate in overload resolution etc. etc.) then we would not need silly prefixes.
So, the central claim that Hungarian is for things that are not types is wrong. It's being used for type information. Richer type information than the traditional C type information, certainly; it's type information that encodes some kind of semantic detail to indicate the purpose of the objects. But it's still type information, and the proper solution has always been to encode it into the type system. Encoding it into the type system is far and away the best way to obtain proper validation and enforcement of the rules. Variables names simply do not cut the mustard.
In other words, the aim should not be "make wrong code look wrong to the developer". It should be "make wrong code look wrong to the compiler".
I think it massively clutters up the source code.
It also doesn't gain you much in a strongly typed language. If you do any form of type mismatch tomfoolery, the compiler will tell you about it.
Hungarian notation only makes sense in languages without user-defined types. In a modern functional or OO-language, you would encode information about the "kind" of value into the datatype or class rather than into the variable name.
Several answers reference Joels article. Note however that his example is in VBScript, which didn't support user-defined classes (for a long time at least). In a language with user-defined types you would solve the same problem by creating a HtmlEncodedString-type and then let the Write method accept only that. In a statically typed language, the compiler will catch any encoding-errors, in a dynamically typed you would get a runtime exception - but in any case you are protected against writing unencoded strings. Hungarian notations just turns the programmer into a human type-checker, with is the kind of job that is typically better handled by software.
Joel distinguishes between "systems hungarian" and "apps hungarian", where "systems hungarian" encodes the built-in types like int, float and so on, and "apps hungarian" encodes "kinds", which is higher-level meta-info about variable beyound the machine type, In a OO or modern functional language you can create user-defined types, so there is no distinction between type and "kind" in this sense - both can be represented by the type system - and "apps" hungarian is just as redundant as "systems" hungarian.
So to answer your question: Systems hungarian would only be useful in a unsafe, weakly typed language where e.g. assigning a float value to an int variable will crash the system. Hungarian notation was specifically invented in the sixties for use in BCPL, a pretty low-level language which didn't do any type checking at all. I dont think any language in general use today have this problem, but the notation lived on as a kind of cargo cult programming.
Apps hungarian will make sense if you are working with a language without user defined types, like legacy VBScript or early versions of VB. Perhaps also early versions of Perl and PHP. Again, using it in a modern languge is pure cargo cult.
In any other language, hungarian is just ugly, redundant and fragile. It repeats information already known from the type system, and you should not repeat yourself. Use a descriptive name for the variable that describes the intent of this specific instance of the type. Use the type system to encode invariants and meta info about "kinds" or "classes" of variables - ie. types.
The general point of Joels article - to have wrong code look wrong - is a very good principle. However an even better protection against bugs is to - when at all possible - have wrong code to be detected automatically by the compiler.
I always use Hungarian notation for all my projects. I find it really helpful when I'm dealing with 100s of different identifier names.
For example, when I call a function requiring a string I can type 's' and hit control-space and my IDE will show me exactly the variable names prefixed with 's' .
Another advantage, when I prefix u for unsigned and i for signed ints, I immediately see where I am mixing signed and unsigned in potentially dangerous ways.
I cannot remember the number of times when in a huge 75000 line codebase, bugs were caused (by me and others too) due to naming local variables the same as existing member variables of that class. Since then, I always prefix members with 'm_'
Its a question of taste and experience. Don't knock it until you've tried it.
You're forgetting the number one reason to include this information. It has nothing to do with you, the programmer. It has everything to do with the person coming down the road 2 or 3 years after you leave the company who has to read that stuff.
Yes, an IDE will quickly identify types for you. However, when you're reading through some long batches of 'business rules' code, it's nice to not have to pause on each variable to find out what type it is. When I see things like strUserID, intProduct or guiProductID, it makes for much easier 'ramp up' time.
I agree that MS went way too far with some of their naming conventions - I categorize that in the "too much of a good thing" pile.
Naming conventions are good things, provided you stick to them. I've gone through enough old code that had me constantly going back to look at the definitions for so many similarly-named variables that I push "camel casing" (as it was called at a previous job). Right now I'm on a job that has many thousand of lines of completely uncommented classic ASP code with VBScript and it's a nightmare trying to figure things out.
Tacking on cryptic characters at the beginning of each variable name is unnecessary and shows that the variable name by itself isn't descriptive enough. Most languages require the variable type at declaration anyway, so that information is already available.
There's also the situation where, during maintenance, a variable type needs to change. Example: if a variable declared as "uint_16 u16foo" needs to become a 64-bit unsigned, one of two things will happen:
You'll go through and change each variable name (making sure not to hose any unrelated variables with the same name), or
Just change the type and not change the name, which will only cause confusion.
Joel Spolsky wrote a good blog post about this.
http://www.joelonsoftware.com/articles/Wrong.html
Basically it comes down to not making your code harder to read when a decent IDE will tell you want type the variable is if you can't remember. Also, if you make your code compartmentalized enough, you don't have to remember what a variable was declared as three pages up.
Isn't scope more important than type these days, e.g.
* l for local
* a for argument
* m for member
* g for global
* etc
With modern techniques of refactoring old code, search and replace of a symbol because you changed its type is tedious, the compiler will catch type changes, but often will not catch incorrect use of scope, sensible naming conventions help here.
There is no reason why you should not make correct use of Hungarian notation. It's unpopularity is due to a long-running back-lash against the mis-use of Hungarian notation, especially in the Windows APIs.
In the bad-old days, before anything resembling an IDE existed for DOS (odds are you didn't have enough free memory to run the compiler under Windows, so your development was done in DOS), you didn't get any help from hovering your mouse over a variable name. (Assuming you had a mouse.) What did you did have to deal with were event callback functions in which everything was passed to you as either a 16-bit int (WORD) or 32-bit int (LONG WORD). You then had to cast those parameter to the appropriate types for the given event type. In effect, much of the API was virtually type-less.
The result, an API with parameter names like these:
LRESULT CALLBACK WindowProc(HWND hwnd,
UINT uMsg,
WPARAM wParam,
LPARAM lParam);
Note that the names wParam and lParam, although pretty awful, aren't really any worse than naming them param1 and param2.
To make matters worse, Window 3.0/3.1 had two types of pointers, near and far. So, for example, the return value from memory management function LocalLock was a PVOID, but the return value from GlobalLock was an LPVOID (with the 'L' for long). That awful notation then got extended so that a long pointer string was prefixed lp, to distinguish it from a string that had simply been malloc'd.
It's no surprise that there was a backlash against this sort of thing.
Hungarian Notation can be useful in languages without compile-time type checking, as it would allow developer to quickly remind herself of how the particular variable is used. It does nothing for performance or behavior. It is supposed to improve code readability and is mostly a matter a taste and coding style. For this very reason it is criticized by many developers -- not everybody has the same wiring in the brain.
For the compile-time type-checking languages it is mostly useless -- scrolling up a few lines should reveal the declaration and thus type. If you global variables or your code block spans for much more than one screen, you have grave design and reusability issues. Thus one of the criticisms is that Hungarian Notation allows developers to have bad design and easily get away with it. This is probably one of the reasons for hatered.
On the other hand, there can be cases where even compile-time type-checking languages would benefit from Hungarian Notation -- void pointers or HANDLE's in win32 API. These obfuscates the actual data type, and there might be a merit to use Hungarian Notation there. Yet, if one can know the type of data at build time, why not to use the appropriate data type.
In general, there are no hard reasons not to use Hungarian Notation. It is a matter of likes, policies, and coding style.
As a Python programmer, Hungarian Notation falls apart pretty fast. In Python, I don't care if something is a string - I care if it can act like a string (i.e. if it has a ___str___() method which returns a string).
For example, let's say we have foo as an integer, 12
foo = 12
Hungarian notation tells us that we should call that iFoo or something, to denote it's an integer, so that later on, we know what it is. Except in Python, that doesn't work, or rather, it doesn't make sense. In Python, I decide what type I want when I use it. Do I want a string? well if I do something like this:
print "The current value of foo is %s" % foo
Note the %s - string. Foo isn't a string, but the % operator will call foo.___str___() and use the result (assuming it exists). foo is still an integer, but we treat it as a string if we want a string. If we want a float, then we treat it as a float. In dynamically typed languages like Python, Hungarian Notation is pointless, because it doesn't matter what type something is until you use it, and if you need a specific type, then just make sure to cast it to that type (e.g. float(foo)) when you use it.
Note that dynamic languages like PHP don't have this benefit - PHP tries to do 'the right thing' in the background based on an obscure set of rules that almost no one has memorized, which often results in catastrophic messes unexpectedly. In this case, some sort of naming mechanism, like $files_count or $file_name, can be handy.
In my view, Hungarian Notation is like leeches. Maybe in the past they were useful, or at least they seemed useful, but nowadays it's just a lot of extra typing for not a lot of benefit.
The IDE should impart that useful information. Hungarian might have made some sort (not a whole lot, but some sort) of sense when IDE's were much less advanced.
Apps Hungarian is Greek to me--in a good way
As an engineer, not a programmer, I immediately took to Joel's article on the merits of Apps Hungarian: "Making Wrong Code Look Wrong". I like Apps Hungarian because it mimics how engineering, science, and mathematics represent equations and formulas using sub- and super-scripted symbols (like Greek letters, mathematical operators, etc.). Take a particular example of Newton's Law of Universal Gravity: first in standard mathematical notation, and then in Apps Hungarian pseudo-code:
frcGravityEarthMars = G * massEarth * massMars / norm(posEarth - posMars)
In the mathematical notation, the most prominent symbols are those representing the kind of information stored in the variable: force, mass, position vector, etc. The subscripts play second fiddle to clarify: position of what? This is exactly what Apps Hungarian is doing; it's telling you the kind of thing stored in the variable first and then getting into specifics--about the closest code can get to mathematical notation.
Clearly strong typing can resolve the safe vs. unsafe string example from Joel's essay, but you wouldn't define separate types for position and velocity vectors; both are double arrays of size three and anything you're likely to do to one might apply to the other. Furthermore, it make perfect sense to concatenate position and velocity (to make a state vector) or take their dot product, but probably not to add them. How would typing allow the first two and prohibit the second, and how would such a system extend to every possible operation you might want to protect? Unless you were willing to encode all of math and physics in your typing system.
On top of all that, lots of engineering is done in weakly typed high-level languages like Matlab, or old ones like Fortran 77 or Ada.
So if you have a fancy language and IDE and Apps Hungarian doesn't help you then forget it--lots of folks apparently have. But for me, a worse than a novice programmer who is working in weakly or dynamically typed languages, I can write better code faster with Apps Hungarian than without.
It's incredibly redundant and useless is most modern IDEs, where they do a good job of making the type apparent.
Plus -- to me -- it's just annoying to see intI, strUserName, etc. :)
If I feel that useful information is being imparted, why shouldn't I put it right there where it's available?
Then who cares what anybody else thinks? If you find it useful, then use the notation.
Im my experience, it is bad because:
1 - then you break all the code if you need to change the type of a variable (i.e. if you need to extend a 32 bits integer to a 64 bits integer);
2 - this is useless information as the type is either already in the declaration or you use a dynamic language where the actual type should not be so important in the first place.
Moreover, with a language accepting generic programming (i.e. functions where the type of some variables is not determine when you write the function) or with dynamic typing system (i.e. when the type is not even determine at compile time), how would you name your variables? And most modern languages support one or the other, even if in a restricted form.
In Joel Spolsky's Making Wrong Code Look Wrong he explains that what everybody thinks of as Hungarian Notation (which he calls Systems Hungarian) is not what was it was really intended to be (what he calls Apps Hungarian). Scroll down to the I’m Hungary heading to see this discussion.
Basically, Systems Hungarian is worthless. It just tells you the same thing your compiler and/or IDE will tell you.
Apps Hungarian tells you what the variable is supposed to mean, and can actually be useful.
I've always thought that a prefix or two in the right place wouldn't hurt. I think if I can impart something useful, like "Hey this is an interface, don't count on specific behaviour" right there, as in IEnumerable, I oughtta do it. Comment can clutter things up much more than just a one or two character symbol.
It's a useful convention for naming controls on a form (btnOK, txtLastName etc.), if the list of controls shows up in an alphabetized pull-down list in your IDE.
I tend to use Hungarian Notation with ASP.NET server controls only, otherwise I find it too hard to work out what controls are what on the form.
Take this code snippet:
<asp:Label ID="lblFirstName" runat="server" Text="First Name" />
<asp:TextBox ID="txtFirstName" runat="server" />
<asp:RequiredFieldValidator ID="rfvFirstName" runat="server" ... />
If someone can show a better way of having that set of control names without Hungarian I'd be tempted to move to it.
Joel's article is great, but it seems to omit one major point:
Hungarian makes a particular 'idea' (kind + identifier name) unique,
or near-unique, across the codebase - even a very large codebase.
That's huge for code maintenance.
It means you can use good ol' single-line text search
(grep, findstr, 'find in all files') to find EVERY mention of that 'idea'.
Why is that important when we have IDE's that know how to read code?
Because they're not very good at it yet. This is hard to see in a small codebase,
but obvious in a large one - when the 'idea' might be mentioned in comments,
XML files, Perl scripts, and also in places outside source control (documents, wikis,
bug databases).
You do have to be a little careful even here - e.g. token-pasting in C/C++ macros
can hide mentions of the identifier. Such cases can be dealt with using
coding conventions, and anyway they tend to affect only a minority of the identifiers in the
codebase.
P.S. To the point about using the type system vs. Hungarian - it's best to use both.
You only need wrong code to look wrong if the compiler won't catch it for you. There are plenty of cases where it is infeasible to make the compiler catch it. But where it's feasible - yes, please do that instead!
When considering feasibility, though, do consider the negative effects of splitting up types. e.g. in C#, wrapping 'int' with a non-built-in type has huge consequences. So it makes sense in some situations, but not in all of them.
Debunking the benefits of Hungarian Notation
It provides a way of distinguishing variables.
If the type is all that distinguishes the one value from another, then it can only be for the conversion of one type to another. If you have the same value that is being converted between types, chances are you should be doing this in a function dedicated to conversion. (I have seen hungarianed VB6 leftovers use strings on all of their method parameters simply because they could not figure out how to deserialize a JSON object, or properly comprehend how to declare or use nullable types.) If you have two variables distinguished only by the Hungarian prefix, and they are not a conversion from one to the other, then you need to elaborate on your intention with them.
It makes the code more readable.
I have found that Hungarian notation makes people lazy with their variable names. They have something to distinguish it by, and they feel no need to elaborate to its purpose. This is what you will typically find in Hungarian notated code vs. modern: sSQL vs. groupSelectSql (or usually no sSQL at all because they are supposed to be using the ORM that was put in by earlier developers.), sValue vs. formCollectionValue (or usually no sValue either, because they happen to be in MVC and should be using its model binding features), sType vs. publishSource, etc.
It can't be readability. I see more sTemp1, sTemp2... sTempN from any given hungarianed VB6 leftover than everybody else combined.
It prevents errors.
This would be by virtue of number 2, which is false.
In the words of the master:
http://www.joelonsoftware.com/articles/Wrong.html
An interesting reading, as usual.
Extracts:
"Somebody, somewhere, read Simonyi’s paper, where he used the word “type,” and thought he meant type, like class, like in a type system, like the type checking that the compiler does. He did not. He explained very carefully exactly what he meant by the word “type,” but it didn’t help. The damage was done."
"But there’s still a tremendous amount of value to Apps Hungarian, in that it increases collocation in code, which makes the code easier to read, write, debug, and maintain, and, most importantly, it makes wrong code look wrong."
Make sure you have some time before reading Joel On Software. :)
Several reasons:
Any modern IDE will give you the variable type by simply hovering your mouse over the variable.
Most type names are way long (think HttpClientRequestProvider) to be reasonably used as prefix.
The type information does not carry the right information, it is just paraphrasing the variable declaration, instead of outlining the purpose of the variable (think myInteger vs. pageSize).
I don't think everyone is rabidly against it. In languages without static types, it's pretty useful. I definitely prefer it when it's used to give information that is not already in the type. Like in C, char * szName says that the variable will refer to a null terminated string -- that's not implicit in char* -- of course, a typedef would also help.
Joel had a great article on using hungarian to tell if a variable was HTML encoded or not:
http://www.joelonsoftware.com/articles/Wrong.html
Anyway, I tend to dislike Hungarian when it's used to impart information I already know.
Of course when 99% of programmers agree on something, there is something wrong. The reason they agree here is because most of them have never used Hungarian notation correctly.
For a detailed argument, I refer you to a blog post I have made on the subject.
http://codingthriller.blogspot.com/2007/11/rediscovering-hungarian-notation.html
I started coding pretty much the about the time Hungarian notation was invented and the first time I was forced to use it on a project I hated it.
After a while I realised that when it was done properly it did actually help and these days I love it.
But like all things good, it has to be learnt and understood and to do it properly takes time.
The Hungarian notation was abused, particularly by Microsoft, leading to prefixes longer than the variable name, and showing it is quite rigid, particularly when you change the types (the infamous lparam/wparam, of different type/size in Win16, identical in Win32).
Thus, both due to this abuse, and its use by M$, it was put down as useless.
At my work, we code in Java, but the founder cames from MFC world, so use similar code style (aligned braces, I like this!, capitals to method names, I am used to that, prefix like m_ to class members (fields), s_ to static members, etc.).
And they said all variables should have a prefix showing its type (eg. a BufferedReader is named brData). Which shown as being a bad idea, as the types can change but the names doesn't follow, or coders are not consistent in the use of these prefixes (I even see aBuffer, theProxy, etc.!).
Personally, I chose for a few prefixes that I find useful, the most important being b to prefix boolean variables, as they are the only ones where I allow syntax like if (bVar) (no use of autocast of some values to true or false).
When I coded in C, I used a prefix for variables allocated with malloc, as a reminder it should be freed later. Etc.
So, basically, I don't reject this notation as a whole, but took what seems fitting for my needs.
And of course, when contributing to some project (work, open source), I just use the conventions in place!

How can I program a simple chat bot AI?

I want to build a bot that asks someone a few simple questions and branches based on the answer. I realize parsing meaning from the human responses will be challenging, but how do you setup the program to deal with the "state" of the conversation?
It will be a one-to-one conversation between a human and the bot.
You probably want to look into Markov Chains as the basics for the bot AI. I wrote something a long time ago (the code to which I'm not proud of at all, and needs some mods to run on Python > 1.5) that may be a useful starting place for you: http://sourceforge.net/projects/benzo/
EDIT: Here's a minimal example in Python of a Markov Chain that accepts input from stdin and outputs text based on the probabilities of words succeeding one another in the input. It's optimized for IRC-style chat logs, but running any decent-sized text through it should demonstrate the concepts:
import random, sys
NONWORD = "\n"
STARTKEY = NONWORD, NONWORD
MAXGEN=1000
class MarkovChainer(object):
def __init__(self):
self.state = dict()
def input(self, input):
word1, word2 = STARTKEY
for word3 in input.split():
self.state.setdefault((word1, word2), list()).append(word3)
word1, word2 = word2, word3
self.state.setdefault((word1, word2), list()).append(NONWORD)
def output(self):
output = list()
word1, word2 = STARTKEY
for i in range(MAXGEN):
word3 = random.choice(self.state[(word1,word2)])
if word3 == NONWORD: break
output.append(word3)
word1, word2 = word2, word3
return " ".join(output)
if __name__ == "__main__":
c = MarkovChainer()
c.input(sys.stdin.read())
print c.output()
It's pretty easy from here to plug in persistence and an IRC library and have the basis of the type of bot you're talking about.
Folks have mentioned already that statefulness isn't a big component of typical chatbots:
a pure Markov implementations may express a very loose sort of state if it is growing its lexicon and table in real time—earlier utterances by the human interlocutor may get regurgitated by chance later in the conversation—but the Markov model doesn't have any inherent mechanism for selecting or producing such responses.
a parsing-based bot (e.g. ELIZA) generally attempts to respond to (some of the) semantic content of the most recent input from the user without significant regard for prior exchanges.
That said, you certainly can add some amount of state to a chatbot, regardless of the input-parsing and statement-synthesis model you're using. How to do that depends a lot on what you want to accomplish with your statefulness, and that's not really clear from your question. A couple general ideas, however:
Create a keyword stack. As your human offers input, parse out keywords from their statements/questions and throw those keywords onto a stack of some sort. When your chatbot fails to come up with something compelling to respond to in the most recent input—or, perhaps, just at random, to mix things up—go back to your stack, grab a previous keyword, and use that to seed your next synthesis. For bonus points, have the bot explicitly acknowledge that it's going back to a previous subject, e.g. "Wait, HUMAN, earlier you mentioned foo. [Sentence seeded by foo]".
Build RPG-like dialogue logic into the bot. As your parsing human input, toggle flags for specific conversational prompts or content from the user and conditionally alter what the chatbot can talk about, or how it communicates. For example, a chatbot bristling (or scolding, or laughing) at foul language is fairly common; a chatbot that will get het up, and conditionally remain so until apologized to, would be an interesting stateful variation on this. Switch output to ALL CAPS, throw in confrontational rhetoric or demands or sobbing, etc.
Can you clarify a little what you want the state to help you accomplish?
Imagine a neural network with parsing capabilities in each node or neuron. Depending on rules and parsing results, neurons fire. If certain neurons fire, you get a good idea about topic and semantic of the question and therefore can give a good answer.
Memory is done by keeping topics talked about in a session, adding to the firing for the next question, and therefore guiding the selection process of possible answers at the end.
Keep your rules and patterns in a knowledge base, but compile them into memory at start time, with a neuron per rule. You can engineer synapses using something like listeners or event functions.
I think you can look at the code for Kooky, and IIRC it also uses Markov Chains.
Also check out the kooky quotes, they were featured on Coding Horror not long ago and some are hilarious.
I think to start this project, it would be good to have a database with questions (organized as a tree. In every node one or more questions).
These questions sould be answered with "yes " or "no".
If the bot starts to question, it can start with any question from yuor database of questions marked as a start-question. The answer is the way to the next node in the tree.
Edit: Here is a somple one written in ruby you can start with: rubyBOT
naive chatbot program. No parsing, no cleverness, just a training file and output.
It first trains itself on a text and then later uses the data from that training to generate responses to the interlocutor’s input. The training process creates a dictionary where each key is a word and the value is a list of all the words that follow that word sequentially anywhere in the training text. If a word features more than once in this list then that reflects and it is more likely to be chosen by the bot, no need for probabilistic stuff just do it with a list.
The bot chooses a random word from your input and generates a response by choosing another random word that has been seen to be a successor to its held word. It then repeats the process by finding a successor to that word in turn and carrying on iteratively until it thinks it’s said enough. It reaches that conclusion by stopping at a word that was prior to a punctuation mark in the training text. It then returns to input mode again to let you respond, and so on.
It isn’t very realistic but I hereby challenge anyone to do better in 71 lines of code !! This is a great challenge for any budding Pythonists, and I just wish I could open the challenge to a wider audience than the small number of visitors I get to this blog. To code a bot that is always guaranteed to be grammatical must surely be closer to several hundred lines, I simplified hugely by just trying to think of the simplest rule to give the computer a mere stab at having something to say.
Its responses are rather impressionistic to say the least ! Also you have to put what you say in single quotes.
I used War and Peace for my “corpus” which took a couple of hours for the training run, use a shorter file if you are impatient…
here is the trainer
#lukebot-trainer.py
import pickle
b=open('war&peace.txt')
text=[]
for line in b:
for word in line.split():
text.append (word)
b.close()
textset=list(set(text))
follow={}
for l in range(len(textset)):
working=[]
check=textset[l]
for w in range(len(text)-1):
if check==text[w] and text[w][-1] not in '(),.?!':
working.append(str(text[w+1]))
follow[check]=working
a=open('lexicon-luke','wb')
pickle.dump(follow,a,2)
a.close()
here is the bot
#lukebot.py
import pickle,random
a=open('lexicon-luke','rb')
successorlist=pickle.load(a)
a.close()
def nextword(a):
if a in successorlist:
return random.choice(successorlist[a])
else:
return 'the'
speech=''
while speech!='quit':
speech=raw_input('>')
s=random.choice(speech.split())
response=''
while True:
neword=nextword(s)
response+=' '+neword
s=neword
if neword[-1] in ',?!.':
break
print response
You tend to get an uncanny feeling when it says something that seems partially to make sense.
I would suggest looking at Bayesian probabilities. Then just monitor the chat room for a period of time to create your probability tree.
I'm not sure this is what you're looking for, but there's an old program called ELIZA which could hold a conversation by taking what you said and spitting it back at you after performing some simple textual transformations.
If I remember correctly, many people were convinced that they were "talking" to a real person and had long elaborate conversations with it.
If you're just dabbling, I believe Pidgin allows you to script chat style behavior. Part of the framework probably tacks the state of who sent the message when, and you'd want to keep a log of your bot's internal state for each of the last N messages. Future state decisions could be hardcoded based on inspection of previous states and the content of the most recent few messages. Or you could do something like the Markov chains discussed and use it both for parsing and generating.
If you do not require a learning bot, using AIML (http://www.aiml.net/) will most likely produce the result you want, at least with respect to the bot parsing input and answering based on it.
You would reuse or create "brains" made of XML (in the AIML-format) and parse/run them in a program (parser). There are parsers made in several different languages to choose from, and as far as I can tell the code seems to be open source in most cases.
You can use "ChatterBot", and host it locally using - 'flask-chatterbot-master"
Links:
[ChatterBot Installation]
https://chatterbot.readthedocs.io/en/stable/setup.html
[Host Locally using - flask-chatterbot-master]: https://github.com/chamkank/flask-chatterbot
Cheers,
Ratnakar