What is spam?

There are a number of definitions of spam, but basically, it's unsolicited commercial email (UCE). It's an advertisement you didn't ask for, and it's the email equivalent of a telemarketing fax or phone call.

Why should I care?

If you only get one or two pieces of spam every now and then, well, you probably don't need to care. But some people get literally hundreds of them a day. Imagine coming in after a weekend and finding you have hundreds of emails, and having to sift through them all to see if there are any legitimate ones.

Or, if you have a BlackBerry or similar device, imagine it going off every time you receive an emailed ad for a porn site, an online pharmacy, or whatever. Imagine it doing that several times an hour, 24 hours a day, 7 days a week.

How do they get my address?

There are lots of ways. Here are just a few.

You registered a domain name

When you register a domain name, you have to provide contact information for at least one contact, and this includes an email address. Some registries make an effort to hide this info, but it's essentially public information. If you tell me your domain name, I can look it up and probably tell you the email addresses that were used to register it. Spammers do this.

Your Web site lists your email address

Just as search engines like Google crawl the Web to index the contents of Web pages, the spammers have crawlers which go around harvesting email addresses.

You posted to a newsgroup or discussion forum

The spammers crawl these, too, and harvest any email addresses on them. This is why you'll often see people putting obviously fake email addresses, or sometimes spelling it out like "joe underscore smith at whatever dot com" so that a human can easily understand the address but it doesn't look like the email addresses a crawler is looking for.

One solution to this is to get a second email address (e.g. a hotmail account), and use that email address for these posts. That way, when the spammers harvest the address and start sending to it, it won't end up in your main mailbox. If the volume of spam gets too high, just sign up for another hotmail account and stop using the first one.

You typed your email address on a form on a Web site

Lots of Web sites ask for your email address. A lot of these are legitimate businesses who want to send you their newsletter or a status update on the product you ordered. But not all of them, and it may be hard to tell sometimes. This is another good reason to use a second email address.

Your email address is easy to guess

Some spammers use dictionary attacks. If your domain is whatever.com, they'll try hundreds, even thousands, of common names @whatever.com: andy@whatever.com, bob@whatever.com, and so on. They'll also try combinations of names and initials. Most of them will fail, but it doesn't take long to try them, and some will probably get through. So if your email address is just your name, or name and initials, and if your name is even remotely common, you could start getting spam without ever once posting your email address anywhere public.

They tracked you

You got a piece of spam from them once, and you clicked on the "Click here to remove me link." Now you get more spam, because the remove-me link doesn't remove you; instead, it confirms that they have your correct email address, and that you're a real live human being who actually reads the spam they send you. So don't ever click on one of those links, or reply to a piece of spam.

You have spyware on your PC and it told them your email address.

You have your mail software configured to download any images that are referenced in the email. Lots of mail software does this, but it can be used to confirm to the spammers that they have your real address and that you read their emails. They simply put a unique tracking ID in each email, and embed that tracking ID in the link to the picture. When your mail client contacts their Web server to get the image, it sends the tracking ID, and they know you read their email.

Why do I get ads for stuff unrelated to me?

Spammers don't care. They don't care if you want spam. They don't care if you read it. They don't care if they're sending you ads which are relevant to you. Why not? Because the cost of sending spam is extremely low; it's almost free. It's cheaper to send out a million messages offering to make a gender-specific body part larger, even though half of them will reach people who lack that body part, than it is to try to filter the list to get only people of that gender.

How can I stop it?

You can't stop it, short of entirely cutting yourself off from email. There are no filters which stop 100% of spam without also getting in the way of ham (ham is a term the anti-spam community uses to refer to legitimate email), and in general, the higher the rate at which you stop spam, the more ham you accidentally block as well. And some of the spammers are quite clever; they will craft messages in an attempt to slip them past anti-spam filters. Some spammers even set up their own anti-spam filters and use them to test their own messages to see how likely they are to be rejected.

Anti-spam packages divide broadly into two categories: those targeted at end users, and those targeted at mail servers. If you're looking for a end-user package, you could look at your Internet security or anti-virus program and see if it, or a newer version of it, offers anti-spam capabilities. There are also a number of anti-spam products on the market.

At the mail server level, the goal is to protect an entire organization from spam. There are some anti-spam products which are separate from your mail server and scan mail before it reaches the server; there are others which are specific to the type of mail server you're using (e.g. Lotus Domino or Microsoft Exchange) and integrate into it. They tend to use many of the same techniques, which include:

IP blacklists

Spammers sometimes send all the spam themselves. Most of the time, however, they find other machines on the Internet which are insecure and which can be used to relay the spam; this gets the spam out more quickly, as the work is spread among multiple servers, and also makes it harder to track the true source of the email. Some of these machines are mail servers which aren't configured properly, and others are simply users' desktop PCs which have been infected by worms or spyware. Some worms/spyware allow the bad guys to use your machine to send emails; these machines are called zombies.

In either case, however, there are large volumes of spam coming from a relatively small number of machines. Each machine has a unique IP address, and there are a number of public blacklists which list machines which send spam. Some of them simply list any machine which is sending spam, while some of them list machines which have been tested and found to be insecure (so while they may or may not be sending spam right now, they're open and could be used by a spammer at any moment).

The anti-spam product looks at the IP address which is trying to send it email, and checks that address in whatever blacklists it's configured to use. There are dozens of blacklists, with different listing policies, and the administrator has to choose carefully. Some, for instance, are very aggressive, and may list an entire ISP if that ISP has a handful of customers which send spam. But any large ISP is likely to have a number of spammers as customers, and some blacklists tend to block most major ISPs; clearly, if you're running a business, you can't use a list like that. Each blacklist typically has a Web page saying what their criteria are, and you should read those criteria and decide whether they're reasonable for your intended use or not.

Content filtering rules

These rules look for words or phrases which commonly appear in spam. For example, Viagra is commonly mentioned, so there might be a rule which looks for that word. But Viagra is also commonly mentioned in jokes, and it might also be legitimately mentioned in private email. This is an example of the fine line between ham and spam; if you block every email that has the word Viagra in it, you will indeed block a lot of spam, but you'll also block some ham.

Often, these rules are updated in a fashion similar to how virus definitions are updated, so if there's a new piece of spam going around which has a particular phrase in it, that phrase quickly gets added to the rules. If a new piece of spam comes out and it includes the phrase "Buy your Viagra from our online pharmacy" in it, that phrase gets added to the rules. This phrase won't likely appear in ham, so while this rule won't block as much spam as simply looking for the word "Viagra," it has the advantage of not blocking any ham.

Most content filtering solutions which include definition updates also include generic rules, so that if you receive a new spam before it makes it into the latest definitions, it may still trigger the generic rules (in the above example, the ad will still trigger the generic "Viagra" rule).

Fingerprinting

A spammer, sending messages to millions of recipients, doesn't have time to write millions of different messages; each recipient will get a message which is very similar to, if not exactly the same as, what the other recipients get. So the anti-spam vendors create fingerprints for each spam message they see. The anti-spam program on your server creates a fingerprint and checks it against the database of fingerprints that the vendor has produced, and if it finds a match, the message is considered spam.

Bayesian classification

A Bayesian classifier examines every word in every message and builds a database of how often those words appear in ham and how often they appear in spam. Every word in every new message is checked against the database.

One popular classifier gives each word a probability ranging from 0 (has been seen in ham but never in spam) to 1 (has been seen in spam but never in ham). Words which appear frequently in both ham and spam will end up with scores around 0.5. When a new message comes in, it checks all the words in the message against the database, and throws out all but the ten words whose scores are farthest from 0.5 (so it automatically ignores any words which aren't useful in determining whether a message is ham or spam). It takes those scores and runs them through a mathematical formula to come out with an overall score. If most or all of those ten words are spammy, the spam score will be very high. But if most of them are hammy and there's just one spammy word (which might be the case, for instance, if your buddy emails you a joke about Viagra), the spam score will be quite low.

Bayesian classifiers are very powerful in that they can adapt to your email. For instance, if you're a physician, Viagra may appear quite frequently in your ham, and of course it will appear quite frequently in your spam, so the classifier will learn that this word is not useful in classifying your email. But if you're a banker, chances are your ham rarely mentions Viagra, but your spam often does, and the classifier will learn that the presence of that word in your email strongly suggests that it's spam.

Bayesian classifiers can also adapt to the spammers' attempts to avoid them. There are lots of ways to write something which the brain will interpret as Viagra; for instance, V1agra, Wiagra, and Viaqra all look close enough that the brain automatically corrects them as you read. A hard-coded rule to look for the word Viagra will not catch these. But a Bayesian classifier usually will. The first couple of times the word shows up, there is no score for it in the database, but chances are that the other words in the message will result in it being flagged as spam. So the classifier learns that the word "V1agra" only ever shows up in spam, and it very quickly becomes able to use that new word to detect spam. In fact, since "Viagra" sometimes shows up in legitimate email but "V1agra" never does, "V1agra" ends up with a spam score of 100%, even higher than the score for the real word "Viagra".

Whitelists

Whitelists allow the administrator to tailor the system's operation. Let's say your biggest customer happens to use an ISP which frequently gets listed on IP blacklists as being a spammer. You want to use the blacklists, because they can get rid of the majority of spam, but you don't want to block your biggest customer. You can whitelist their domain name, and then any emails coming from their domain will automatically bypass spam filtering.

User tools for email classification

Some anti-spam products put buttons in the user's mail program to let the user classify emails as spam or ham. This is useful in a couple of ways.

One is for a Bayesian classifier. The classifier can only update its database if it knows whether a particular message is ham or spam. Sometimes it can determine this based on the other tests it uses, and then it can automatically update the database. But sometimes, the other tests are inconclusive. If the user clicks a button to say "This message is spam" or "This message is not spam" then the database can be updated accordingly.

The other is to fix mistakes. If you've used hotmail's "This is not spam" feature before, you'll know what it does when you click it: it adds the sender to your list of people who send you legitimate email (a.k.a. a whitelist).

Should I block spam or just tag it?

Often, it makes more sense to tag it and have it automatically filed into a "junk mail" folder in the user's mailbox, because anti-spam filtering is not perfect. If the suspected spam gets filed in a junk mail folder, the user can look through it once in a while and make sure there's nothing legitimate in there, but if the mail was rejected outright, the user never gets that chance.

Some anti-spam programs can block some mail and flag other mail. For instance, if a message matches the fingerprint of a widespread spam message, it's pretty safe to block it. Depending on what blacklists you choose, it may also be pretty safe to block messages which trigger the blacklists, too. If it gets past those checks, but gets flagged by another test like the Bayesian classifier or content filtering, maybe you want to let the message through but flag it, just to be on the safe side.

It is important to consult with your user community, and with management, when making these sorts of decisions. Not everyone has the same tolerance for blocking the occasional piece of ham; some people and businesses can live with it, while others would find even a single blocked piece of ham unacceptable and will put up with more spam in their mailbox in order to avoid blocking ham.

Back to list of newsletters