How to create quasi-copy of a file [closed] - language-agnostic

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 months ago.
Improve this question
I would like to create quasi-copy of my directory with sensitive data.
Then I would like to share this quasi-copy with others to provide so called 'real data'.
This 'real data' would allow others to do tests in matters related to storage performance.
My question is how to create copy of any file ( text, jpeg, sqlite.db, ... ) that will not contain any of its original data, but from point of view of compression, de-duplication and so on would be very similar.
I appreciate any pointers to tools, libs that helps with creating such quasi copy.
I appreciate any pointers what to measure and how to measure similarity of original file and its quasi copy.

I don't know whether a "quasi-copy" is an established notion and whether there are accepted rules and procedures. But here is a crude take on how to "mask" data for protection: replace words by equal-length sequences of (perhaps adjusted) random characters. One cannot then do a very accurate storage analysis of real data but that has to suffer after any data scrambling.
One way to build such a "quasi-word," wrapped in a program for convenience
use warnings;
use strict;
use feature 'say';
use Scalar::Util qw(looks_like_number);
my $word = shift // die "Usage: $0 word\n";
my #alphabet = 'a'..'z';
my $quasi_word;
foreach my $c (split '', $word) {
if (looks_like_number($c)) {
$quasi_word .= int rand 10;
}
else {
$quasi_word .= $alphabet[int rand 26];
}
}
say $quasi_word;
This doesn't cut it at all for a de-duplication analysis. For that one can replace repeated words by the same random sequence, for example as follows.
First make a pass over the words from the file and build a frequency hash, of how many times each word appears. Then as each word is processed it is first checked whether it repeats, and if it does a random replacement is built only the first time and later that is used every time.
Further adjustments for specific needs should be easy to add.
Any full masking (scrambling/tokenization...) of data of course cannot allow a precise analysis of compression of real data using such a mangled set.
If you know specific sensitive parts then only those can be masked and that would improve the accuracy of the following analyses considerably.
This will be slow but if a set of files need be built once in a while it shouldn't matter.
A question was raised of the "criteria" for the "similarity" of masked data, and the question itself was closed for lack of detail. Here is a comment on that.
It seems that only "measure" of "similarity" is simply whether the copy behaves the same in the storage performance analysis as the real data would. But, one can't tell without using real data for that analysis! (What clearly would reveal that data.)
The one way I can think of is to build a copy using a candidate approach and then use it (yourself) for individual components of that analysis. Does it compress (roughly) the same? How about de-duplication? How about ...? Etc. Then make your choices.
If the used approach is flexible enough the masking can then be adjusted for whichever part of analysis "failed" -- the copy behaved substantially differently. (If compression was very different perhaps refine your algorithm to study words and produce more similar obfuscation, etc.)

Related

Adding user defined functions to a simple calculator YACC [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I've been searching all over the internet for a comprehensible example to how you can define and call a function in a simple calculator interpreter. Maybe I've found the answer but since I'm not familiar with YACC I couldn't see it.
So the question is, how do you set up a symbol table for user defined functions and how do you store/call these functions in a calculator interpreter?
I'm basically looking to achieve something like this:
def sum(a,b) { a + b }
sum(5,5)
result:
10
Any pointers or examples would be appreciated
That's definitely diving in to the concepts required to interpret (or compile) a programming language, which makes it difficult to provide an answer in a format suitable for StackOverflow. Here's a quick outline:
You need a symbol table which can hold both functions and variables. Or two symbol tables. In the first case, the mapped value will be some kind of variant type, such as a discriminated union; you might need that anyway if you have more than one type of variable. In the second case, you can use a specific type for the mapped value of function names. I'd go for the first option, because it allows functions to be first-class objects.
You need some kind of type which represents the "value" of a function definition. The obvious type is the Abstract Syntax Tree (AST) of an expression (or a program), and doing that will generally simplify your code so I'd highly recommend it. That means that the calculator/parser will not actually evaluate 5+5 (even if that is the literal input) or a+b, but rather will return an AST to whoever called the parser. Consequently, you will need:
A function which can evaluate an AST. That's usually straightforward to write, since it's just a depth-first tree walk. But now you need to worry about variable scope because when you do evaluate they body of your function sum, you probably want to only locally set the values of the parameters.
If you manage all that, you will have gone several steps beyond the usual "let's build a calculator with flex and bison" project, and I definitely encourage you to do so. You may want to take a look at the classic text Structure and Interpretation of Computer Programs (Abelson & Sussman, 1996; often referred to simply as SICP).

How should I test HTML generated for a web page? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have some HTML on a page that has a bunch of tables and data (it's a report page). This is all legacy code, so no harassment necessary on the use of tables.
Given that it is legacy code, it is fragile, and we want to confirm that the table looks like we want (number of columns, rows, and the data inside of them are accurate).
My first inclination is to use selenium web driver and run through everything that way (Page Object Pattern), but a co-worker suggested that I just view source of the page, copy the table in question, and then just use this to do a string comparison in the test.
My initial thoughts on his proposal is that it is not a good test because you're starting with the answer and then writing a test to make sure you get that answer (essentially non-TDD). But I'm not sure that's a good enough answer in this case.
How should I test HTML table to make sure all columns, rows are how we like, in addition to the contents of each cell?
It depends. String matching sounds like Approval Testing, depending on just how dynamic the table is that could be fine.
If I already had Selenium tests running I'ld stick with what I have. Using findElements to count and verify the various columns, rows, and values.
Re: your comment if you cannot convince the developers to add ids, names, or something to make your job easier and you do go the Selenium route then xpath is probably what you will want to use. We've created utility methods to help in these sort of situations:
public boolean isLabeledTextPresent(String label, String text) {
WebElement element = findElement(By.xpath("//tr/th/label[contains(text(), '" +
label + "')]/ancestor::tr/td"));
String labeledText = element.getText().trim();
return labeledText.contains(text);
}
I think both methods are valid, it really depends on what you are trying to do and what advantages/disadvantages work best for you.
It would take a little work (depending on you or your teammates skill sets and experience) to write a Selenium script to scrape the table and verify certain things.
Advantages:
Once completed, it will validate very quickly and will be less fragile than method #2.
Disadvantages:
This is dependent on how quickly you can write a script and how easily you are able to validate all the things you want to validate. If all you want is # cols/rows and cell content, that should be very easy. If you want to validate things like formatting (size, color, etc.) then that starts to get a little more complicated to do through code.
It would be super easy to copy/paste HTML and validate against that. The problem is, as you pointed out, you are starting with the answer in some respects. You can get around that by validating that the HTML source for the table is correct. That will have to be done manually but once you get that, you can open the page and compare source of the table vs what you have validated.
Advantages:
You will be able to tell when anything changes... formatting, data, # cells, ... everything.
Disadvantages:
You will be able to tell when anything changes... lol. Your test will fail when anything is changed which will make the test very fragile if you expect that the table will ever get updated. If it gets updated, you will have to revalidate all the HTML for the table which could get to be a tedious process depending how often you expect this to happen. One thing that will help with this validation is to use a diffing tool... you can quickly determine what has changed and validate that instead of having to validate everything each time there is a change.
I would lean towards #1, write the script. It will be less fragile and as long as someone has the right skills shouldn't be that big of a task.
EDIT
You didn't specify what language you are working in but here's some code in Java that hopefully will point you in the right direction if you choose to write a script.
WebElement table = driver.findElement(...));
List<WebElement> rows = table.findElements(By.tagName("tr"));
Assert("Validate expected number of rows", rows.size(), expectedNoOfRows);
for (int row = 0; row < rows.size(); row++)
{
List<WebElement> cells = rows.get(row).findElements(By.tagName("td"));
Assert("Validate expected number of cells in row " + row, cells.size(), expectedNoOfCells[row]);
for (int cell = 0; cell < cells.size(); cell++)
{
Assert("Validate expected text in (" + cell + "," + row + ")", cells.get(cell).getText().trim(), expectedText[cell][row]);
}
}
You could do something like this at a basic level. If you want to get more fancy, you could add logic that looks for specific parts of the report so you can get a "landmark", e.g. Summary, Data, ... making headings up ... so you will know what to expect in the next section.
You could run a variation of this code to dump the different values, number of rows, number of cells in each row, and cell contents. Once you validate that those values are correct, you could use that as your master and do comparisons vs it. That will keep you from false fails on comparing straight HTML source. Maybe it's something in the middle between a script and text comparison based on HTML source.

When could a CSV records *not* have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.
Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.
As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >
Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

Is it bad to perform two different tasks in the same loop? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working on a highly-specialized search engine for my database. When the user submits a search request, the engine splits the search terms into an array and loops through. Inside the loop, each search term is examined against several possible scenarios to determine what it could mean. When a search term matches a scenario, a WHERE condition is added to the SQL query. Some terms can have multiple meanings, and in those cases the engine builds a list of suggestions to help the user to narrow the results.
Aside: In case anyone is interested to know, ambigous terms are refined by prefixing them with a keyword. For example, 1954 could be a year or a serial number. The engine will suggest both of these scenarios to the user and modify the search term to either year:1954 or serial:1954.
Building the SQL query and the refine suggestions in the same loop feels somehow wrong to me, but to separate them would add more overhead because I would have to loop through the same array twice and test all the same scenarios twice. What is the better course of action?
I'd probably factor out the two actions into their own functions. Then you'd have
foreach (term in terms) {
doThing1();
doThing2();
}
which is nice and clean.
No. It's not bad. I would think looping twice would be more confusing.
Arguably some of the tasks might be put into functions if the tasks are decoupled enough from each other, however.
I don't think it makes sense to add multiple loops for the sake of theoretical purity, especially given that if you're going to add a loop against multiple scenarios you're going from an O(n) -> O(n*#scenarios). Another way to break this out without falling into the "God Method" trap would be to have a method that runs a single loop and returns an array of matches, and another that runs the search for each element in the match array.
Using the same loop seems as a valid optimization to me, try to keep the code of the two tasks independent so this optimization can be changed if necessary.
Your scenario fits the builder pattern and if each operation is fairly complex then it would serve you well to break things up a bit. This is waaaaaay over engineering if all your logic fits in 50 lines of code, but if you have dependencies to manage and complex logic, then you should be using a proven design pattern to achieve separation of concerns. It might look like this:
var relatedTermsBuilder = new RelatedTermsBuilder();
var whereClauseBuilder = new WhereClauseBuilder();
var compositeBuilder = new CompositeBuilder()
.Add(relatedTermsBuilder)
.Add(whereClauseBuilder);
var parser = new SearchTermParser(compositeBuilder);
parser.Execute("the search phrase");
string[] related = relatedTermsBuilder.Result;
string whereClause = whereClauseBuilder.Result;
The supporting objects would look like:
public interface ISearchTermBuilder {
void Build(string term);
}
public class SearchTermParser {
private readonly ISearchTermBuilder builder;
public SearchTermParser(ISearchTermBuilder builder) {
this.builder = builder;
}
public void Execute(string phrase) {
foreach (var term in Parse(phrase)) {
builder.Build(term);
}
}
private static IEnumerable<string> Parse(string phrase) {
throw new NotImplementedException();
}
}
I'd call it a code smell, but not a very bad one. I would separate out the functionality inside the loop, putting one of the things first, and then after a blank line and/or comment the other one.
I would look to it as if it were an instance of the observer pattern: each time you loop you raise an event, and as many observers as you want can subscribe to it. Of course it would be overkill to do it as the pattern but the similarities tell me that it is just fine to execute two or three or how many actions you want.
I don't think it's wrong to make two actions in one loop. I'd even suggest to make two methods that are called from inside the loop, like:
for (...) {
refineSuggestions(..)
buildQuery();
}
On the other hand, O(n) = O(2n)
So don't worry too much - it isn't such a performance sin.
You could certainly run two loops.
If a lot of this is business logic, you could also create some kind of data structure in the first loop, and then use that to generate the SQL, something like
search_objects = []
loop through term in terms
search_object = {}
search_object.string = term
// suggestion & rules code
search_object.suggestion = suggestion
search_object.rule = { 'contains', 'term' }
search_objects.push(search_object)
loop through search_object in search_objects
//generate SQL based on search_object.rule
This at least saves you from having to do if/then/elses in both loops, and I think it is a bit cleaner to move SQL code creation outside of the first loop.
If the things you're doing in the loop are related, then fine. It probably makes sense to code "the stuff for each iteration" and then wrap it in a loop, since that;s probably how you think of it in your head.
Add a comment and if it gets too long, look at splitting it or using simple utility methods.
I think one could argue that this may not exactly be language-agnostic; it's also highly dependent on what you're trying to accomplish. If you're putting multiple tasks in a loop in such a way that they cannot be easily parallelized by the compiler for a parallel environment, then it is definitely a code smell.

How can I program a simple chat bot AI?

I want to build a bot that asks someone a few simple questions and branches based on the answer. I realize parsing meaning from the human responses will be challenging, but how do you setup the program to deal with the "state" of the conversation?
It will be a one-to-one conversation between a human and the bot.
You probably want to look into Markov Chains as the basics for the bot AI. I wrote something a long time ago (the code to which I'm not proud of at all, and needs some mods to run on Python > 1.5) that may be a useful starting place for you: http://sourceforge.net/projects/benzo/
EDIT: Here's a minimal example in Python of a Markov Chain that accepts input from stdin and outputs text based on the probabilities of words succeeding one another in the input. It's optimized for IRC-style chat logs, but running any decent-sized text through it should demonstrate the concepts:
import random, sys
NONWORD = "\n"
STARTKEY = NONWORD, NONWORD
MAXGEN=1000
class MarkovChainer(object):
def __init__(self):
self.state = dict()
def input(self, input):
word1, word2 = STARTKEY
for word3 in input.split():
self.state.setdefault((word1, word2), list()).append(word3)
word1, word2 = word2, word3
self.state.setdefault((word1, word2), list()).append(NONWORD)
def output(self):
output = list()
word1, word2 = STARTKEY
for i in range(MAXGEN):
word3 = random.choice(self.state[(word1,word2)])
if word3 == NONWORD: break
output.append(word3)
word1, word2 = word2, word3
return " ".join(output)
if __name__ == "__main__":
c = MarkovChainer()
c.input(sys.stdin.read())
print c.output()
It's pretty easy from here to plug in persistence and an IRC library and have the basis of the type of bot you're talking about.
Folks have mentioned already that statefulness isn't a big component of typical chatbots:
a pure Markov implementations may express a very loose sort of state if it is growing its lexicon and table in real time—earlier utterances by the human interlocutor may get regurgitated by chance later in the conversation—but the Markov model doesn't have any inherent mechanism for selecting or producing such responses.
a parsing-based bot (e.g. ELIZA) generally attempts to respond to (some of the) semantic content of the most recent input from the user without significant regard for prior exchanges.
That said, you certainly can add some amount of state to a chatbot, regardless of the input-parsing and statement-synthesis model you're using. How to do that depends a lot on what you want to accomplish with your statefulness, and that's not really clear from your question. A couple general ideas, however:
Create a keyword stack. As your human offers input, parse out keywords from their statements/questions and throw those keywords onto a stack of some sort. When your chatbot fails to come up with something compelling to respond to in the most recent input—or, perhaps, just at random, to mix things up—go back to your stack, grab a previous keyword, and use that to seed your next synthesis. For bonus points, have the bot explicitly acknowledge that it's going back to a previous subject, e.g. "Wait, HUMAN, earlier you mentioned foo. [Sentence seeded by foo]".
Build RPG-like dialogue logic into the bot. As your parsing human input, toggle flags for specific conversational prompts or content from the user and conditionally alter what the chatbot can talk about, or how it communicates. For example, a chatbot bristling (or scolding, or laughing) at foul language is fairly common; a chatbot that will get het up, and conditionally remain so until apologized to, would be an interesting stateful variation on this. Switch output to ALL CAPS, throw in confrontational rhetoric or demands or sobbing, etc.
Can you clarify a little what you want the state to help you accomplish?
Imagine a neural network with parsing capabilities in each node or neuron. Depending on rules and parsing results, neurons fire. If certain neurons fire, you get a good idea about topic and semantic of the question and therefore can give a good answer.
Memory is done by keeping topics talked about in a session, adding to the firing for the next question, and therefore guiding the selection process of possible answers at the end.
Keep your rules and patterns in a knowledge base, but compile them into memory at start time, with a neuron per rule. You can engineer synapses using something like listeners or event functions.
I think you can look at the code for Kooky, and IIRC it also uses Markov Chains.
Also check out the kooky quotes, they were featured on Coding Horror not long ago and some are hilarious.
I think to start this project, it would be good to have a database with questions (organized as a tree. In every node one or more questions).
These questions sould be answered with "yes " or "no".
If the bot starts to question, it can start with any question from yuor database of questions marked as a start-question. The answer is the way to the next node in the tree.
Edit: Here is a somple one written in ruby you can start with: rubyBOT
naive chatbot program. No parsing, no cleverness, just a training file and output.
It first trains itself on a text and then later uses the data from that training to generate responses to the interlocutor’s input. The training process creates a dictionary where each key is a word and the value is a list of all the words that follow that word sequentially anywhere in the training text. If a word features more than once in this list then that reflects and it is more likely to be chosen by the bot, no need for probabilistic stuff just do it with a list.
The bot chooses a random word from your input and generates a response by choosing another random word that has been seen to be a successor to its held word. It then repeats the process by finding a successor to that word in turn and carrying on iteratively until it thinks it’s said enough. It reaches that conclusion by stopping at a word that was prior to a punctuation mark in the training text. It then returns to input mode again to let you respond, and so on.
It isn’t very realistic but I hereby challenge anyone to do better in 71 lines of code !! This is a great challenge for any budding Pythonists, and I just wish I could open the challenge to a wider audience than the small number of visitors I get to this blog. To code a bot that is always guaranteed to be grammatical must surely be closer to several hundred lines, I simplified hugely by just trying to think of the simplest rule to give the computer a mere stab at having something to say.
Its responses are rather impressionistic to say the least ! Also you have to put what you say in single quotes.
I used War and Peace for my “corpus” which took a couple of hours for the training run, use a shorter file if you are impatient…
here is the trainer
#lukebot-trainer.py
import pickle
b=open('war&peace.txt')
text=[]
for line in b:
for word in line.split():
text.append (word)
b.close()
textset=list(set(text))
follow={}
for l in range(len(textset)):
working=[]
check=textset[l]
for w in range(len(text)-1):
if check==text[w] and text[w][-1] not in '(),.?!':
working.append(str(text[w+1]))
follow[check]=working
a=open('lexicon-luke','wb')
pickle.dump(follow,a,2)
a.close()
here is the bot
#lukebot.py
import pickle,random
a=open('lexicon-luke','rb')
successorlist=pickle.load(a)
a.close()
def nextword(a):
if a in successorlist:
return random.choice(successorlist[a])
else:
return 'the'
speech=''
while speech!='quit':
speech=raw_input('>')
s=random.choice(speech.split())
response=''
while True:
neword=nextword(s)
response+=' '+neword
s=neword
if neword[-1] in ',?!.':
break
print response
You tend to get an uncanny feeling when it says something that seems partially to make sense.
I would suggest looking at Bayesian probabilities. Then just monitor the chat room for a period of time to create your probability tree.
I'm not sure this is what you're looking for, but there's an old program called ELIZA which could hold a conversation by taking what you said and spitting it back at you after performing some simple textual transformations.
If I remember correctly, many people were convinced that they were "talking" to a real person and had long elaborate conversations with it.
If you're just dabbling, I believe Pidgin allows you to script chat style behavior. Part of the framework probably tacks the state of who sent the message when, and you'd want to keep a log of your bot's internal state for each of the last N messages. Future state decisions could be hardcoded based on inspection of previous states and the content of the most recent few messages. Or you could do something like the Markov chains discussed and use it both for parsing and generating.
If you do not require a learning bot, using AIML (http://www.aiml.net/) will most likely produce the result you want, at least with respect to the bot parsing input and answering based on it.
You would reuse or create "brains" made of XML (in the AIML-format) and parse/run them in a program (parser). There are parsers made in several different languages to choose from, and as far as I can tell the code seems to be open source in most cases.
You can use "ChatterBot", and host it locally using - 'flask-chatterbot-master"
Links:
[ChatterBot Installation]
https://chatterbot.readthedocs.io/en/stable/setup.html
[Host Locally using - flask-chatterbot-master]: https://github.com/chamkank/flask-chatterbot
Cheers,
Ratnakar