Open Source Filtering

Tuesday, April 01 2003 @ 12:06 PM UTC
Contributed by: cmurdock
(Originally posted on opensourceschools.org)

By Cindy Murdock
Network Administrator
Meadville Public Library/Crawford County Federated Library System

Because of the recent regulations in the U.S. resulting from the passage of the Children's Internet Protection Act, schools and libraries seem to be in the same circumstances in regards to filtering. To be brief, if a library or school receives Erate funding for Internet access or internal connections, they are required to place filters on their computers with Internet access. Arguments about the appropriateness or efficacy of filtering aside, standard commercial filters have many faults in comparison to open source filtering options.

Commercial filters are often expensive, especially when deployed on a large number of computers, as would be the case in a school computer lab or in a medium or large library with many computers for patron use. In contrast, open source filters are generally freely available for download. In addition, since commercial filters are proprietary, in many cases the system administrator does not have the opportunity to modify or even view the lists of blocked sites, a.k.a. blacklists. With these filters, one must usually be content with choosing particular categories to filter, and must trust that the vendor has placed only sites within those chosen categories within the blacklists.

Also since these filters are commericially produced, frequently the product's target audience is either parents who are using the filters on their home computers to protect their children, or businesses intent upon keeping their employees from browsing sites deemed inappropriate during working hours. Thus, these filters may have features that would be inappropriate in a library or school setting. For example, the filter that we formerly used at our library would redirect a user to a site that was deemed "kid-friendly"; this tended to confuse and often annoy our patrons.

In addition, the filter would often behave in odd ways beyond our control; sometimes instead of redirecting the user to another site, it would cause a popup window to appear with a vague network error message which tended to confuse both staff and patrons. With the open source filters we are currently using, we have full control over what happens when one of our patrons tries to access a blocked site.

Many commercial filters are also client-dependent, meaning that they must be installed on each individual computer that you need to filter. This makes changing the configuration or installing a new version of the software tiresome because tech staff will have to physically make the changes on a per-computer basis, which can be difficult to do on heavily used public computers. The alternative to this time-consuming endeavor is to use server-based filtering software, but the costs for such commercial products can quickly escalate, as charges for most such products are on a per-seat basis. Furthermore, these filtering programs run on free open source operating systems so there are no expensive server liscensing fees to pay.

With a typical server-based filtering solution running on a proprietary operating system, you would have to pay hundreds or possibly thousands of dollars for the server's operating system, on top of per-seat user liscensing fees for the server as well as the filtering software. With open source filtering you can be up and running with no software cost whatsoever. These server-based filters are also client-independent for the computers that are being filtered; the client computers can be set up for filtering with a simple reconfiguration of the browser(s) of your choosing.

Furthermore, the commercial products are usually WYSIWYG in nature, which tends to encourage complacency; the filters are installed, and staff simply hope that they filter what they should and let sites pass through that shouldn't be filtered. Frequently there are no logging features so there is no way to tell what is or is not getting through, or sometimes if there are they don't get used because staff may be unaware of them.

Though there are a few disadvantages to using open source filters, namely the learning curve involved with using them and the time involved in keeping them up to date, in the case of URL-based filtering, the problems relating to commercial filters do not apply to them. At the Meadville Public Library, we are using two open source filters: squidGuard (www.squidguard.org) and DansGuardian (www.dansguardian.org). Both are available freely for download at the above Web sites, and they for the most part run on any open source operating system. Both are also server-based, making modifying the filtering organization-wide quick and simple. If you have secure shell access to your server on which the filtering software resides, you can log in and make changes from anywhere.

SquidGuard, which works through the cacheing and proxying program squid, is a URL-based filter. Essentially, this means that it will filter by web address or by words and/or phrases, but only those words or phrases found in the URL. This is advantageous in that it gives you fairly precise control over what is filtered, without a lot of context-insensitive filtering. However, if a website is not in your list of filtered sites, it won't be filtered. This necessitates vigilance to ensure that the filter is working properly, since probably thousands of new websites crop up every day. But since these filters have full logging capabilities, you can know exactly what is or is not being filtered.

In fact, in order to help me monitor for sites that should be blocked, I have a simple script that scans the web logs each night for a list of certain choice terms and emails me the results. That way I can easily tell whether someone is getting through to something they shouldn't be, and add it to the filter. I can also quickly look at the full Internet logs to see if there are sites accessed by that person before or after the URL in question to discover if there are any sites that either my script or squidGuard itself didn't catch. And, if that's not enough, you can scan the logs in real time for the list of terms to catch them in the act!

In addition, you're not limited to any software manufacturer's standards of what should be in your blacklists. With squidGuard, you can download the blacklist provided at the squidGuard website, or you can create your own custom lists, or both. Also, when a site is filtered, squidGuard can be configured to redirect the user to a website or file of your choice. At our library, we redirect users to a page that states that they have attempted to visit a site in violation of our Internet Use Policy, with a link to the policy if they should wish to read it. There are no confusing error messages or unexpected redirects to random sites, so the patron knows exactly what has just happened.

Since squidGuard is URL-based, it catches fewer sites than keyword filtering, which scans the entirety of a document for chosen words or phrases. This can be a boon in that it is less likely to block sites overzealously, but in some areas you may want somewhat more stringent filtering. So, in our library's children's area we have been using another filter in addition to squidGuard, called DansGuardian. DansGuardian is a keyword filter that works in conjunction either with squidGuard or on its own with just squid. It can also filter by filename extension or mime type, which you can use to prevent users from downloading files to your computers, or by various PICS ratings systems.

The one major disadvantage of the version of DansGuardian that we're using (1.1.5) is that there can only be one filtering configuration per server, so if you wanted to have more than one configuration, you would have to have copies of DansGuardian residing on different servers, or perhaps multiple copies of it running on the same server. SquidGuard, on the other hand, can have several configuations for various types of users. However, the latest version of DansGuardian (2.2) seems to be greatly improved and apparently includes this feature as well as speed improvements and integrated url-based filtering. But, this latest version was just released on November 18th and I haven't had a chance to try it out yet.

If having support available to you is one of your major motivations for purchasing commercial software, commercial support is available for these filters. There are consulting firms that specialize in filtering with squidGuard, and commercial support for DansGuardian may be purchased from the author of the program. However, I have found the support from the programs' respective mailing lists to be more than adequate for dealing with any issues that crop up, and frequently I can get help directly from the authors themselves.

If you're willing to invest a little time in learning how to install and maintain these filters, they are an economical, reliable and effective alternative to commercial filtering products. The level of control possible with these filters is amazing. It may take a little longer to get them up and running in comparison to commercial products, especially if you're not experienced with open source software, but in the end the effort is worth it.