Google, search engines fat on SPAM?

I recently read an article in The Register about Google’s recent issues with the massive amounts of spam online. Let’s just be honest moment here: SEO is great but it’s exploited WAY too much on what I would consider illegitimate websites. You know – the ones whose sole purpose is for the display of advertisements, portals, etc.

Here’s an idea: Let’s apply a Web 2.0 ideology to our searches in much the same way we see sites like Slashdot, Digg, and Newsvine moderate reader comments. One crucial difference: we don’t necessarily elevate page rankings for any given page – that task is to be completed as it’s always been done. Rather, those irrelevant sites should probably suffer de-listing if enough users give it a thumbs-down based on the nature of the content. Before you begin explaining the counter-arguments, let me save you some keystrokes. What happens if users begin thumbing-down legitimate sites based on their personal viewpoints and beliefs, on the basis of competition, or other that’s-not-the-point-of-this-self-regulating-system reason? Maybe one way around it is to flag the page (or even domain) for review by a select group of people. People could first earn the title of moderator, then even earn rewards for proper moderation, as reviewed by peers (those who want to be moderators?). Either way, it’s a community approach to regulating the quality of search results.

…just ideas I had while reading the article. I’m tired of all the junk on the net and in my email.

Caching HTML Output with Smarty

I’ve begun using the Smarty template engine for my projects requiring dynamic content – you know, the good ol’ MVC (model-view-controller) approach to (web) applications. Because the sites I work on aren’t particularly high-traffic, I never really thought too hard about caching… until I actually began thinking about caching. The question is “why not?”.

With servers as quick as they are these days, even highly dynamic pages can be processed rather quickly. On one of my development machines, I’m getting between 65 and 70 fulfilled requests per second on a page with little in the way of optimization and no caching. By adding a simple caching scheme to this page via some built-in Smarty functions, that number jumps to about 105-110 fulfilled requests per second. Super!

Honestly, it’s so simple I might as well just point you to the appropriate documentation that tells you how to do it: HERE. The most important thing to notice is that you at least need $smarty->caching=TRUE;, and for goodness sakes, make sure your cache directory is writable by Apache (I would also make sure to either have your Smarty directory outside the site root or disallow access via a .htaccess file).

Here’s to a 70% performance boost!

Adding a new hard drive to a Linux system

Time came to add a second hard disk to my workstation. I didn’t need a whole lot – just another 250GB for backup and extra storage space until the new workstation arrives later this summer. Here’s a quick tutorial on how to get the new disk in and running on you linux box.

Once the hardware is properly installed, open up a terminal and log-in as root.

/sbin/fdisk /dev/hdb(assuming this is your second drive and your primary is /dev/hda).
/sbin/fdisk /dev/hdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 30401.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Type m for help…

Command (m for help): m
Command action
a toggle a bootable flag
b edit bsd disklabel
c toggle the dos compatibility flag
d delete a partition
l list known partition types
m print this menu
n add a new partition
o create a new empty DOS partition table
p print the partition table
q quit without saving changes
s create a new empty Sun disklabel
t change a partition's system id
u change display/entry units
v verify the partition table
w write table to disk and exit
x extra functionality (experts only)

type “n” for a new partition,
“p” for primary,
“1” for partition,
use the default size suggested (usually just hit enter for default):
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-30401, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-30401, default 30401):
Using default value 30401

Type “p” to get a list of the partition table:
Command (m for help): p

Disk /dev/hdb: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdb1 1 30401 244196001 83 Linux

Then type “w” to write the changes to disk (create the partition on your new drive)
Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

…you’re almost done. Just a couple more steps

The next command will make the filesystem on the disk:
/sbin/mkfs -t ext3 /dev/hdb1

The app will begin printing an incrementing number, and before you know it it’ll be done:
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
30539776 inodes, 61049000 blocks
3052450 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=62914560
1864 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

Final steps

Make a new directory in your filesystem to which the new drive will be mapped:
mkdir /drive2

Mount the drive:
mount -t ext3 /dev/hdb1 /drive2

Edit your fstab to auto-mount the disc:
(add this following line)/dev/hdb1 /drive2 ext3 defaults 1 1

That’s it!

Storing hierarchical data in a database part 2b: Modified preorder tree traversal – insertions

Last time I introduced the Modified preorder tree traversal algorithm as a method to store relationships between nodes. This time I’ll show you how nodes are inserted into the tree. Note that I’m using MySQL, so you may need to change your queries slightly depending on the DB you’re using.

Before we get started, I’d like to share some links I’ve found since the last post.
Wikipedia: Tree traversal
MySQL AB: Managing Hierarchical Data in MySQL

Consider our tree, introduced last time:

Say you wanted to insert “Boxer” as a child node of Dog. The left/right values for this new node will be 14/15, respectively. Before we can insert, however, we need to make some room: all left/right values greater than 13 need to be incremented by two so we can fit [14]Boxer[15] in (Dog becomes [13]Dog[16] and Cat becomes [17]Cat[18] ).

$sql_lft="UPDATE animals SET lft=lft+2 WHERE lft>13";
$sql_right="UPDATE animals SET rght=rght+2 WHERE rght>13";
$sql_insert="INSERT INTO animals (`node_name`,`lft`,`rght`) VALUES ('Boxer',14,15)";

In order to insert a leaf node (at the same level), you simply use the rght value from a neighbor (or parent node’s lft value in some cases… you should be able to figure-out why).

The trickiest part really isn’t the insert as much as it is writing an algorithm that determines the proper lft/rght values at every point in the hierarchy. There are lots of ways to do it, so I’ll leave it up to your imagination. The best way to understand what’s going on is by trying it yourself. If you get stuck, feel free to ask!

Next time around I’m going to discuss the idea of moving (multiple) nodes within the tree, and a few other little pieces of functionality that should serve you well…

Semantic Research: Knowlege Organization

My lab had the opportunity to meet with the folks over at SemanticResearch. Overall, definitely a very intriguing company. But first things first: their offices are located IN the MTv Real World San Diego house, which as some of you might know – is awesome! I really don’t know how they get any work done in such a great view of the harbor.

You may be here to read about the company, but I think you’re probably more interested in hearing about what they (and other companies like them) do. In its simplest form, you create a web, a network, between nodes (typically a noun) by defining certain pre-determined relationships between them (verb). By combining many nodes and their relationships you create something akin to a web. Its usefulness is seen when you find links between two unrelated nodes.

Here’s a brief example, of which I’ll use the following notation:
[node1]---(relationship)---[node2]

[John]---(friend of)---[Mike]
[Mike]---(friend of)---[Anthony]
[Anthony]---(owns)---[beach house]
[beach house]---(has address)---[123 Prospect, La Jolla]
[John]--(owns)---[condo]
[condo]---(has address)---[456 Pearl., La Jolla]

So what’s the relationship between these two seemingly unrelated homes in La Jolla? Ok – it’s pretty easy to see in this example, but imagine a social network with 5,000 nodes where you’re trying to determine the relationships between two individuals.

We definitely look forward to working more with SemanticResearch on incorporating some of their ideas into the DFL. We’ll have 800-100 fish, with several different habitats, and several other defined relationships (e.g., predator/prey, host/parasite, etc).

DFL site update

I’ve been making a lot of minor updates to the DFL website. I finally got around to making the site look decent in Internet Explorer. I really should have done it earlier. The only major rendering differences between the Safari/Opera/Firefox camp and IE is the location of the main page block: IE positioned against the left edge of the screen, S/O/F is centered (the way it’s intended). Essentially IE doesn’t recognize the align=center attribute on div tags to mean the div should be centered on the page. Oh well. You win some, you lose some.

Near-term plans:

  • The hierarchy algorithms I’m working on for my own personal (read: consulting job) use are going to be put to use to create a fish cladogram – essentially a tree diagram showing the relationships between different species of fish.
  • Carina is workring on writing and rewriting content for the site – make it a bit more user-friendly.
  • We’re going to be adding habitat profiles for all the fish in coming weeks. Among them: Fresh water, Pelagic and deep sea, intertidal, hard bottom, soft bottom, neritic, and continental slope.
  • Glossary of terms: Carina is collaborating with the Phil Hastings and HJ Walker on putting together a glossary of terms we’re going to be using on the site.

I’ll post updates as new features are added.

Cross-browser CSS

I must admit – it’s only been recently that I began creating sites almost exclusively favoring CSS over table layouts. I’m starting to get it, and it’s great! There’s one problem with using CSS: different browsers render pages differently… very differently in some cases.

I choose Safari and Firefox when I’m first building a layout because I know they adhere to standards (Number 1 and 3 top “winners” of the Acid2 tests, respectively). As one would expect, Internet Explorer has some rendering issues issues. But why do I say that? Firefox, Opera, and Safari generally render pages quite consistently, whereas IE does not.

Here are some general tips that I’ve learned over the last few months now that I’m becoming more of a HTML/CSS jockey again.

  • Do initial development for Safari, Opera, and Firefox (SOF).
  • Keep separate stylesheets: One for IE, and another for SOF (or other Acid2-compliant browsers)
  • Use browser detection to load the proper stylesheet(s). Server-side or javascript work.
  • Be patient. Don’t be frustrated if you can’t get both versions to look *exactly* alike. There’s often a level that’s “good enough.”

Browser detection script…

<script language="JavaScript">
var browser=navigator.appName
if(browser =="Microsoft Internet Explorer")
{
document.write('<link href="/path/to/stylesheet_ie.css" rel="stylesheet" type="text/css">')
}
else
{
document.write('<link href="/path/to/stylesheet.css" rel="stylesheet" type="text/css">')
}
</script>
<noscript>
<link href="/path/to/stylesheet_ie.css" rel="stylesheet" type="text/css">
</noscript>

Notice that I chose the Internet Explorer stylesheet as my default if no javascript is enabled. It’s simply because the market is dominated by IE and not Firefox, so the odds are in your favor.

Amira 3.1 and Movie Maker issues

Many thanks to Don Duncan at Mercury Computer Systems for his help getting the Movie Maker functionality back up and running on my system.

Problem: The Movie Maker module doesn’t work under Fedora FC4/Amira3.1.1. Part of the console error displays:
/path/to/amira/arch-Linux-Optimize/libmpegenc.so symbol errno, version GLIBC_2.0 not defined in file libc.so.6 with link time reference
Bad type "HxMovieMaker"

Solution: Don indicates other customers have had issues running Amira 3.1.1 on version of Linux later than Red Hat 8 (platform Amira was built on). The work-around is to set the following environment variable:
LD_ASSUME_KERNEL=2.4
It has something to do with newer versions of the Native Posix Thread Library. The environment variable tells the system you want to use the older implementation.

If you’re running Fedora you probably don’t want to set the LD_ASSUME_KERNEL=2.4 as a default environment variable, so it might be better to manually set it if you know you’re going to need it in Amira.

Installing AWStats on Fedora (FC4) with Apache virtual hosting

While the exact distribution probably doesn’t matter too much, some steps are pertinent only to Fedora (i.e. the Yum install). Honestly, follow the instructions in the awstats documentation closely and you should be fine. My point here is to point-out some of the finer details.

First step, install AWStats as root:
% sudo yum install awstats

Yum will install awstats into /usr/share/ (/usr/share/awstats)

Now run the configuration script
% cd /usr/share/awstats/tools/
% sudo perl awstats.pl
.....

When it asks for the location of the server configuration file:
/etc/httpd/conf/httpd.conf (your apache conf file)
follow the remaining direction until you exit the configuration script. For the sake of this tutorial I’ll call the site “mysite.”

NOTE: Apache logs are typically in the common log file format. AWStats works to its fullest potential if you change the log file format to combined in your httpd file. If you choose to keep the common file format, you’ll have to make the appropriate changes in awstats.

Now configure the awstats config file you just created with the above script:
% sudo emacs /etc/awstats/awstats.mysite.conf

You’ll need to edit a few lines to get everything working:
(line 51) LogFile="/var/path/to/my/file_access.log"

(line 153) SiteDomain="subdomain.domain.ext"

#and any others you might be using
(line 168) HostAliases="subdomain.domain.ext 127.0.0.1 localhost"

#Optional, but improved security. Leaving this blank allows ALL, otherwise fill-in the IPs that you want to allow
(line 349) AllowAccessFromWebToFollowingIPAddresses="127.0.0.1"

One thing I was unable to get working correctly with HTTP authentication was:
(line 328) AllowAccessFromWebToAuthenticatedUsersOnly=1
and
(line 339) AllowAccessFromWebToFollowingAuthenticatedUsers = "__REMOTE_USER__"

Ok!

now run the awstats.pl script:
% cd /usr/share/awstats/wwwroot/cgi-bin/
% sudo perl awstats.pl -update -config=mysite

If all goes well, you should see something along the lines of:
Found 0 dropped records
Found 0 corrupted records,
Found 150 old records,
Fount 50 new qualified records.

…at the end of the update script output.

So, to do this with multiple domains, just repeat the steps above, making sure to make the appropriate changes to each domain…

You should now be able to visit http://www.yoursite.ext/awstats/awstats.pl?config=mysite and you’ll see the fruits of your labor. I chose not to allow web users to do automatic updates. Rather, I have a cronjob set to run the awstats.pl -update script a once per day
(I don’t administer any high-traffic sites, so it’s not critical to have the most up-to-date records). You can see near the end of my Incremental backups with rsync post for more information on that, if you’re interested.

A word of caution: AWStats is often the target of worm attacks through XSS (cross-site scripting). One reason to use Yum to install/manage awstats is that you don’t have to do any of the work to keep it updated (make sure you enable automatic yum updates in your systems services config (menu: Desktop->System Settings -> Server Settings -> Services; check the “yum” option is checked and you click the “Start” button while it’s highlighted). Also make sure to limit who can see the awstats reports.

Hey there! Come check out all-new content at my new mistercameron.com!