SpamAssassin: Site Wide Bayes not working? - configuration

Long ago I implemented site-wide bayes filtering as per http://wiki.apache.org/spamassassin/SiteWideBayesSetup.
I don’t think it ever worked, and I certainly find that my spam scores are always negative, with BAYES_00 suggesting that Bayes wasn’t used at all.
Here is what I have in my local.cf file:
bayes_path /etc/mail/spamassassin/bayes/bayes
bayes_file_mode 0777
When I run sa-learn I find instead that the tokens are stored in individual home directories.
What is the correct method to get this working?
Supplementary Question: if I can get this working, can I combine the various bayes_tok and other files?

If you get BAYES_00 results, then Bayes is indeed working as it has classified the email as being ham. A neutral result would be BAYES_50. You just need to train the Bayes database properly.
If sa-learn creates/updates bayes files under your home directory, then it is either not reading the desired local.cf file, or the bayes_path gets overridden by a user-specific configuration file (e.g. /root/.spamassassin/user_config).
You could try one of the following:
run sa-learn under the same user account as spamassassin is executed
specify an explicit path to sa-learn, i.e.
sa-learn --dbpath /etc/mail/spamassassin/bayes/bayes
use the -D option to see what is really going on, i.e. which configuration files are being read, etc.
If/when you get it working you can generally not combine the various database files. There are at least a bayes_toks and a bayes_seen file, because one contains the tokens learned and the other has email Message-Id:s and associated training status (spam/ham). Then there can be an optional bayes_journal if you use deferred syncing.
Further details available in the manpage for sa-learn:
https://spamassassin.apache.org/full/3.4.x/doc/sa-learn.html

Related

The point of Stryket.Net since.ignore-changes-in config

Stryker-net has an option since.ignore-changes-in and I'm trying to understand in which use case it may be useful to ignore non C# files 1.
The doc gives that example of value ['/*Assets.json','/favicon.ico'], but if my last commit changes only Assets.json and if I run stryker with "since": {"target": "<sha1 of my previous commit>"}, then no mutation will be found on this change, with or without this ignore-changes-in option, right?
What am I missing?
1: Sure, if I don't find it useful I could just avoid using it on my personal projects. However I'm responsible for providing stryker in CI for several teams in my company and I'm hence interested in making sure I have a good understanding of the consequences of using (or not) this.
I can think of two cases.
First, it might bring some performance gain (just by reducing the amount of files Stryker will get to), e.g. when your change updates 100 XML localizations of some string.
Second, you might have real code you don't want to run Stryker against. E.g. you have some Obsolete folder with a bunch of projects you don't care about but which automatically updates with some renaming and so on. This might be interchangeable with the mutate option, unless you want to have different filters for "normal" and "diff" Stryker executions.

IPFS CID Reproducibility

This is perhaps very basic question but I couldn’t find in my search. Assuming the file content doesn’t change even single bit, would IPFS CID never change even if it’s being added/pinned by different users?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
No, it is not safe to assume that ipfs add <file> will always give the same CID as there are many parameters other than the hash function itself that the binary is free to change over time. At a high level ipfs add turns a file/directory in a tree structure called UnixFS that represents that data, and since the default way in which ipfs add is allowed to change over time it means that CID output by ipfs add example.txt can change
Many of the UnixFS parameters are configurable (and described in ipfs add --help) and include options such as raw leaves and chunk size. This means that if you'd really like to ensure that ipfs add example.txt results in the same CID there are a set of flags you can pass to ensure this is the case.
Note, in general I'd try to avoid importing the same data to IPFS multiple times (it's a waste of resources anyway) although there may be some scenarios where that's just the easiest, or best, thing to do to get your project off the ground.
As long as there won't be any default settings changed (and the user doesn't set any specific ones to override them), all clients will generate the same CID from the same binary data.
But on the other hand, there's talk to replace SHA-256 with Blake2b in the future, and there are two versions of CIDs: v0 and v1.
Most custom settings require v1 of the CID.
The other things which influence the CID are
--raw-leaves
--chunker as well as
--trickle
And
--inline as well as --inline-limit when you're dealing with small files.
IPFS CIDs are not truly self describing in the sense that they don't capture information about how the data was chunked and what kind of merkle dag was created.
If you want the CIDs to be the same, I would suggest using content archive files (and uploading that). Content archive files would contain information about the various parameters #Ruben described in this answer, and different ipfs clients will be forced to use those parameters even if their default settings are different.
For example, I was using web3.storage (chunking size 1MB) and pinata (chunking size 256kb), uploading the same file to them gave different CIDs which was unacceptable for our purposes. Using CAR files allowed to have the same CIDs.

What should I put in header comments at the top of source files?

I've got lots of source code files written in various languages, but none of them have a standard comment at the top (sometimes even across the same project). Some of them don't have any header comment at all :-)
I've been thinking about creating a standard template that I can use at the top of my source files, and was wondering what fields I should include.
I know I want to include my name and a short description of what the file contains/does. Should I also include the date created? The date last modified? The programmer who last modified the file? What other fields have you found to be useful?
Any tips and comments welcome.
Thanks,
Cameron
This seems to be a dying practice.
Some people here on StackOverflow are against code comments altogether (reasoning that code should be written to be self explanatory) While I wouldn't go that far, some of the points of the anti-comment crowd make sense, such as the fact that comments tend to be out of date.
Header blocks of comments suffer from these symptoms even more so. Every organization I've been with that has had these header blocks, they are out of date. They have a author name of some guy who doesnt even work there any more, a description that does not match the code at all (assuming it ever did) and a last modified date, that once compared with version control history, seems to have missed its last dozen updates.
In my personal opinion, keep comments close to the code. If you want to know purpose of, and/or history of, a code file, use your version control system.
Date created, date modified and author who last changed the file should be stored in your source control software.
I usually put:
The main purpose of the file and things within the file.
The project/module the file belongs to.
The license associated with the file (and a LICENSE file in the project root).
Who is responsible for the file (either the team, person, or both)
Back in 2002, when I was straight out of college and jobs were few and far between after the dot-com bust, I joined a service company which used to create software customized for their clients in Java. I had to sit in the office of a client (which was a ramshackle room in an electric sub-station rigged with an AC to keep the servers running), sharing chairs/PCs with other guys in the team. The other engineers (if I can call them engineers ;) in the group used to make changes ad-hoc to the source code, compile the files and put them into production.
No way to figure out who made what change.
No way to figure out why any change was made.
No way to go to previous version of code, unless the engineer "remembered" what he modified.
Backup: Copy over files from the production server, which were replaced with new files.
Location of backup: Home directory of engineer copying over files to production server.
Reports of production servers going down due to botched attempts of copying over files to the server (missed a file to be copied over, backups getting lost or wrong files being copied over or not all files being copied over) were met with shrugs (oh no, is it down? let's see what happened; hey who changed what recently...? ummm...).
During those days, after spending several frustrating days trying to figure out the whos and whys behind the code, I had devised a system for comments in a list in the header of the source file which detailed the following:
Date of change made
Who made the change
Why was the change made
Two months later when the list threatened to challenge the size of the source code in the file, the manager had the bright idea of getting a source version control system.
I have never needed to put any comments in headers of source files (except for copyright notices) in any company I worked since. In my current company, everything else is mostly self-evident by looking at the code, or going to the bug reporting system which is integrated with the source version control system.
What fields do you need? If you have to ask whether to put some info there, you don't really need that info. Unless you are forced, by some bureaucratic incompetence of your employer, I don't see why you should go looking for more info than you already feel should be there.\
In most organizations, all source files have to begin with a legal blurb. If you're really lucky, it's just a one-liner, but in most cases it's a really long block of legalese. As a result, few people ever read these. Our eye just travels to the first program element and then goes up to its documentation.
So if you want to write anything, write it in association with the topmost program element, not the file.
Any other bookkeeping information should generally be part of your version control, not maintained (poorly) in the file itself.
In addition to the comment above stating license, the project that it belongs to, etc I also tend to put the "weird" requirements at the top as well (such as "built with version X of library Y") so you, or the person who picks it up after you won't change something that the program relies on without realizing it (or, if they do, they will at least know what to change back)
A lot depends on whether you're using an auto-documentation generation tool or not.
While I agree with many of the comments, if you're using JavaDoc or some other documentation generating tool that depends on comments, you'll obviously need to include the things it wants to see.
You did not mention that you are using a version control system and your comment in Neil N's answer confirms this for your older code. While using version control is the best way to go I also have experienced many situations where the cost of doing so for older code would not be paid for by the project's sponsor. If you do not have a centralized change history for the project then the change history can be put in the modules. It is good that you are using a version control system for your new code.
Your company name
All rights reserved (c) year - or reference to appropriate license
Project or library this file is for
Module it belongs to
Description of what it contains
History
-------
01/08/2010 - Programmer - version
Initial creation.
01/09/2010 - Programmer - version
Change description.
01/10/2010 - Programmer - version
Change description.
Those useful fields that you mentioned are good ones. Who modified the file and when.
Your version control software should allow for the embedding of keywords within comments. For example, in CVS, the $Id$ will resolve to the file, date/time modified, and user that modified the file. It will automatically be kept up to date with each check-in.
Include the following information:
What this file is for. That's a very useful piece of knowledge and it's more important than anything else. You should tell the reader, why there is such a file, why did you group functions in a separate file/package/module and why they are used. Maybe briefly, one or two lines, but that should be there.
Legal stuff, if appplicant.
Leave the place for special commands of console editors, such as of Emacs.
Add special commands that your auto-documenting system requires.
Things things you shouldn not include are
Who created the file
When it was created
Who modified it the last time
When it was last modified
What was added by the latest modification
You can--and should--retrieve it via the version control system, where it's constantly and automatically kept up-to-date. Let alone that most of these points are just useless.
Who created the file
When it was created
Who modified it the last time
When it was last modified
What was added by the latest modification

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
I find the following lines of code relevant for the spider option in wget.
src/ftp.c
780: /* If we're in spider mode, don't really retrieve anything. The
784: if (opt.spider)
889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227: if (!opt.spider)
1239: if (!opt.spider)
1268: else if (!opt.spider)
1827: if (opt.htmlify && !opt.spider)
src/http.c
64:#include "spider.h"
2405: /* Skip preliminary HEAD request if we're not in spider mode AND
2407: if (!opt.spider
2428: if (opt.spider && !got_head)
2456: /* Default document type is empty. However, if spider mode is
2570: * spider mode. */
2571: else if (opt.spider)
2661: if (opt.spider)
src/res.c
543: int saved_sp_val = opt.spider;
548: opt.spider = false;
551: opt.spider = saved_sp_val;
src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)
src/spider.h
1:/* Declarations for spider.c
src/recur.c
52:#include "spider.h"
279: if (opt.spider)
366: || opt.spider /* opt.recursive is implicitely true */
370: (otherwise unneeded because of --spider or rejected by -R)
375: (opt.spider ? "--spider" :
378: (opt.delete_after || opt.spider
440: if (opt.spider)
src/options.h
62: bool spider; /* Is Wget in spider mode? */
src/init.c
238: { "spider", &opt.spider, cmd_boolean },
src/main.c
56:#include "spider.h"
238: { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435: --spider don't download anything.\n"),
1045: if (opt.recursive && opt.spider)
I would like to see the differences in code, not abstractly. I love code examples.
How do web spiders differ from Wget's spider in code?
A real spider is a lot of work
Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:
Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
You need to be polite to the spidered web sites:
Respect the robots.txt files.
Don't fetch a lot of information too quickly: this overloads the servers.
Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
Besides following links, you may want to parse sitemaps to get URLs of pages.
You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.
This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.
Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.
Let me add a few more technical remarks on three points
Parallel connections / asynchronous socket communication
You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.
A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.
Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.
From the original answer:
The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.
Edit five years later:
Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...
Partitioning the work over several servers
One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.
Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.
If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.
Read the standards
I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.
It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.
I am not sure exactly what the original author of the comment was referring to, but I can guess that wget is slow as a spider, since it appears to only use a single thread of execution (at least by what you have shown).
"Real" spiders such as heritrix use a lot of parallelism and tricks to optimize their crawling speed, while simultaneously being nice to the website they are crawling. This typically means limiting hits to one site at a rate of 1 per second (or so), and crawling multiple websites at the same time.
Again this is all just a guess based on what I know of spiders in general, and what you posted here.
Unfortunately, many of the more well-known 'real' web spiders are closed-source, and indeed closed-binary. However there are a number of basic techniques wget is missing:
Parallelism; you're never going to be able to keep up with the entire web without retrieving multiple pages at a time
Prioritization; some pages are more important to spider than others
Rate limiting; you'll be banned quickly if you keep pulling down pages as quickly as you can
Saving to something other than a local filesystem; the Web is big enough that it's not going to fit in a single directory tree
Rechecking pages periodically without restarting the entire process; in practice, with a real spider you'll want to recheck 'important' pages for updates frequently, while less interesting pages can go for months.
There are also various other inputs that can be used such as sitemaps and the like. Point is, wget isn't designed to spider the entire web, and it's not really a thing that can be captured in a small code sample, as it's a problem of the whole overall technique being used, rather than any single small subroutine being wrong for the task.
I'm not going to go into details of how to spider the internet, I think that wget comment is regarding to spidering one website which is still a serious challenge.
As a spider you need to figure out when to stop, not go into recursive crawls just because the URL changed like date=1/1/1900 to 1/2/1900 and so
Even bigger challenge to sort out URL Rewrite (I have no clue what so ever how google or any other handles this). It's pretty big challenge to crawl enough but not too much. And how one can automatically recognise URL Rewrite with some random parameters and random changes in the content?
You need to parse Flash / Javascript at least up to some level
You need to consider some crazy HTTP issues like base tag. Even parsing the HTML is not easy, considering most of the websites are not XHTML and browsers are so flexible in the syntax.
I don't know how much of these implemented or considered in wget but you might want to take a look at httrack to understand the challenges of this task.
I'd love to give you some code examples but this is big tasks and a decent spider will be about 5000 loc without 3rd party libraries.
+ Some of them already explained by #yaakov-belch so I'm not going to type them again

Change config values on a specific time

I just got a mail saying that I have to change a config value at 2009-09-01 (new taxes). Our normal approach for this would be to to awake at 2009-08-31 at 23:59 and then just change the value manually. Which not is a big problem since this don't happens to often. But it makes me wonder how other people handle issues like this.
So! How do you handle date specific config changes?
(We are working in asp.net but I don't think this has to be language specific)
Br
Carl Bergquist
I'd normally store this kind of data in a database table like this
Key, Value, EffectiveFrom, EffectiveTo
-----------------------------------------
VAT, 15.0, 20081201, 20091231
VAT, 17.5, 20100101, NULL
I'd then use the EffectiveFrom and EffectiveTo dates to chose the value that is effective at the given time. If the rate is open ended then the effecive to could either by NULL or 99991231.
This also allows you to go back without having to change the config. E.g. if someone asks you to recalculate the tax for the previous month before the rate change.
In linux, there is a command "at" for batch execution.
See "man at" for details.
To be honest, waking up near the time and changing it seems to be the simplest and cheapest approach. All of the technical solutions are fine, but it depends where you work.
In our environment it would be cheaper and simpler to get someone to wake up and make the change than to redevelop the functionality of a piece of software that already works. It certainly involves less testing, development overhead and costs which means we would tend to solve the problem as you do, manually.
That depends totally on the situation and the technology.
pjp's idea is good, if you get your config from a database, or as metadata to define the valid time for whole config sets/files.
Another might be: just prepare a new configfile with the new entries and swap them at midnight (probably with a restart of the service/program) whatever.
Swapping them would be possible with at (as given bei Neeraj) ...
If timing is a problem you should handle the change, or at least the timing of the change on the running server (to avoid time out of synch problems).
We got same kind of problem some time before and handled using the following approach.
this is suitable if you are well known to the source that orginates the configuration changes..
In our case, the source exposed a webservice (actualy a third party) which will return a modified config details. And there is a windows service running on our server which keeps on polling the webservice and will update the configuration file if there is any change.
this works perfectly in our case..
You can make use of this approach by changing the polling webservice part to your source of config change (say reading changes from some disk path). But am not sure how this is possible reading config changes from email.
Why not just make a shell script to swap out the files. run it in cron and switch the files out a minute before and send an alert text if NOT successful and an email if successful.
This is an example on a Linux box but I think you get the point and can do this on a Windows box.
Script:
cp /path/to/old/config /path/to/backup/dir/config.timestamp
cp /path/to/new/config
if(/path/to/new/config exsits) {
sendSuccessEmail();
} else {
sendPanicTextAlert();
}
cron:
59 23 31 8 * /path/to/script.sh
you could test this as well before hand just point to some dummy directories and file
I've seen the hybrid approach. Instead of actually changing the data model to include EffectiveDate/EndDate or manually changing the values yourself, schedule a script to change the values automatically. Also, be sure to have a solid test plan that will validate all changes.
However, this type of manual change can have a dramatic impact on reporting. If previous transactions join directly to the tables being changed, numbers in historical reports could change in a very bad way. There really is no "right" answer.
If I'm not able to do something like pjp's solution, I'd use either a scheduled task or a server job to update it automatically at the right time.
But...I'd probably still be awake checking it had worked.
Look the best solution would be to parameterise your config file and add things like when a certain entry should be used from. This would negate the need for any copying or swapping of files and your application would simply deal with it. (That goes for a config file approach or a database)
If you cannot change the current systems and you have to go with swapping the config files, then you also have two options:
Use a scheduled task to kick off a batch job or even a VBScript or PowerShell script (which ever you feel comfortable with) Make sure you set up the correct credentials to be able to do this at the middle of the night and you could also add some checking and mitigation into this approach.
Write a windows Service that does this for you. Here you have all the flexibility you need. Code it to do whatever it needs to do, do all the checks you need to (so that you can keep sleeping rather than making sure it actually worked) etc, etc. You service would then even take care of the scheduling aspect and all will be good. Here you could use xml DOM object and xPath and not replace the file, but simply update the specific entries as required.
Remember that any change to the config file would cause your site to restart, so make sure you take care of all the other housekeeping stuff that this could cause. (Although this would be exactly the same if you where sitting there in the middle of the night copying file around)