Annual Report on Your Data and our Server Logs
To begin, I want to emphasize this article outlines what I see and do on this website. As you travel through the Internet you most likely visit other sites and properties. I can't speak for these people. They may have entirely different programs and policies.
The Data Location
As with many activities, some of your actions on this website are recorded. In the case of visiting this website, the data is written to a server log. Specifically, it is an Apache HTTP server log using the combined format. Ive used a half-dozen web hosting firms throughout the years and all of them have provided me with some sort of server logs. I never had to ask these companies to turn on the logs. The data accrued whether I looked at or not.
In some ways these logs are similar to ones used for other activities outside of the Internet. As more processes get automated, more logs of our activities are recorded. These are things we may take for granted such as using library cards, paying bridge tolls with a Fast Pass, using supermarket membership cards or requesting an absentee ballot. One difference is I know what data is on this sites server logs and how I use it. I don't know how my supermarket uses the data from my membership card which is why I choose not to use it.
Each month the web hosting company I use starts a new log. Every day data is appended to this file. Instead of having 30 daily files, it is one large monthly file that grows. As I write this, the monthly log file is over 70mb which is pale in comparison to other sites. Sometime during the first week of the new month, I copy the previous monthly log to my PC and remove the web server version. I do keep the monthly log files on my PC for an unspecified amount of time for long term trend analysis.
In addition to our files, our web hosting company maintains backups. These files are used in case we need to restore content for any reason. Their backup process includes all our site files so it would include this log. A similar process is used on test servers we maintain with other web hosting companies. However, these server logs include data from internal testers and not the general public.
How Data Gets to Our Server Log
A large portion of the web traffic I see originates from search engines. People will start by typing their query into the search box. Based on their query, a listing to one of our articles shows in the search results. For this example, Ill use Google and search for the term blimperskins.
I intentionally used this term as its a not a real word. I made it up one time when I wanted to swear, but I was in the presence of a 5 year old. The word was intentionally included in a previous article when I was testing to see if others sites might be scraping our stories. Since its a unique word, it was easier to find it in a 70mb data file.
Im also using Google as the search engine example for a number of reasons. The majority of our traffic comes from Google. In addition, we use two of their services -- Google AdSense and Google Analytics. AdSense is the service that provides the contextual ads for our articles for which we receive a percentage of the ad revenue. Google Analytics is one of the tools we use that interprets visitor data.
In the picture above, you can see that my search term shows in my browsers address bar. When I click the article in the results section, not only do I go to the requested page, but I'm also sending information about where I started. This is called the referrer.
A line is written to the monthly server log such as the one below. The actual log entry doesn't use word wrap and I changed the IP address.
68.145.84.73 - - [19/Feb/2006:21:22:06 -0800] "GET /mos/5_Minute_Tips/General/The_Value_of_Saying_No/ HTTP/1.1" 200 6807 "http://www.google.com/search?hl=en&q=blimperskins&btnG=Google+Search" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
This line is just one of a half dozen or so written. The reason is that server logs record a line for each request. In order for a reader to see the article, the server requests a style sheet, images and any other components needed to render the web page. As people move through our web site, a hundred lines could be written to the log depending on which articles they were reading.
What I Can See
I can log into my server as an administrator and use the control panel to see this event from the raw logs. The control panel makes the line easier to read by parsing the data and adding labels. Ive added number labels for the items Ill discuss. If you wish to find more info on the other items, please see our earlier article on web server log information .
(1) IP Address 69.145.84.73
You might think of the IP address as a phone number for the public endpoint. The number references the device connecting to our web server. For most people, the IP address changes over time. Your Internet service provider (ISP) often doesn't have enough IP addresses to assign to all their customers so they allocate them on a temporary basis. As example, the machine that is 68.145.84.73 could be mapped to another IP number 48 hours from now. In some cases, people have static or constant IP addresses based on their ISP or company. There could also be multiple computers behind one of these IP addresses.
As you might guess, there is no way I can say 68.145.84.73 is John Smith or Jane Doe. However, if I suspected this IP address hacked into my system, I might ask the Internet provider to check their logs and tell me who used it. Most likely, the ISP or company would refuse to provide that information to me without some legal authority. Moreover, the ISP would need to know the time the IP address was used on my system. Time of access is important as the IP number in question might be reassigned to another of their customers.
You might be wondering how I would know which ISP or company to call. As you can see in the raw log entry, that information isn't there. There are various services that decode IP addresses and can show who they are assigned to and a general location. These aren't secret tools and you can get an idea of what is provided by going to sites such as ShowMyIP or DNS Stuff and decoding the IP address you're currently using. If you do click these links, an entry will most likely be written to their server logs with this site as the referrer.
You may also see that these services may report a location far different from where you are based. For example, AOL users may show as being in Virginia where the company has a data center. This is also the case for people using anonymous proxy servers or VPN clients.
While I could look up all these IP addresses, it wouldn't be the best use of my time. Instead, I rely partly on Google Analytics and partly on third party packages. I'm currently evaluating a package called WebLog Expert and using their trial version. The company's website provides example reports which can give you a better idea of what else I can see.
The amusing thing about these tools is they never quite agree. I suspect some of the difference is the methods each company uses as well as different data sources. Ive seen similar differences in other 3rd party packages. Perhaps, if I had a large e-commerce site that sold items I would be concerned about these differences.
Some people might already be concerned that I know their approximate region and what value does that provide me. After all, its not like I have people in China reading my site.
Actually, using these tools I see I do have readers in China. One of the features I like about Google Analytics is it provides me a visual overview of my readers. Each dot represents the approximate location of readers. I should also mention that when I use the term reader, it doesn't necessarily mean a person. Some of the connections could be coming from search engine bots or similar automated process.
If I place my cursor over a map dot, I get a label and the number of readers. Ive drawn a red arrow on the picture pointing to the Beijing dot.
If I were to use the WebLog Expert package, I would get similar information without the map. I would see a report listing connections from 131 countries. The reports do differ as do the country names. The biggest difference is the state listing as Google reports connections from far fewer states. If you look at the map above you would think I was being boycotted in some states.
One question people might be thinking is can I tell who the 25 people are from Beijing? The short answer is no for the reason I explained earlier. Unlike some TV episodes or movies, I don't have any tool where I could click on these dots and see a list of names and addresses.
The main reason I cant say who the 25 people are is because I don't have enough information. I have bits and fragments. Although my server log is essentially a database I could mine, I would need to link it with other databases which had that information and there was a common denominator.
Those IP address fragments do provide me with some clues. For instance, I can get a breakdown by host or company which shows country information. Most times the information is generic as shown below. On some occasion the domain name may be personal or configured in such a way to give more detailed information. Last week, I spotted an entry in this list which showed a certain Washington, DC organization and Room 272.
Some members might be wondering if the server logs have more information about them since they provide a user name and password. To the best of my knowledge, the answer is no. Using my own user name and password, I searched the 70mb file and did not see that data. I'm guessing that information is contained in the content managements systems SQL database, not the server log. This example does illustrate that sites may have multiple places where information is stored.
You can now relax since I don't know who you are. That's true, but remember I'm writing this article as the webmaster for this site. Using an earlier example, what if my supermarket decided to offer Internet services? Depending on how they set the service up and their policies, they might join the information from their server logs with the information from my store purchases using the membership card data. Using the combined information they might conclude I like red wine and the Boston Red Sox.
While the location is one data variable I see, I don't find it the most important. Some sites do rely on geo-targeting. The location information may determine which language is presented or what products are offered. I look at it to remind me that I live on a very small, but interconnected planet. It also reminds me that almost 70% of the readers are based in the United States, but I need to be conscious of the other 30%.
On many occasions, Ive spotted a country on the list or a dot on the map and not recognized it. This weekend I looked up the country Vanuatu. Other times, I look at the data and ask questions that I may never know the answer to such as why am I seeing a lot more traffic this week in Paris, France?
(2) What did you want? - /mos/5_Minute_Tips/General/The_Value_of_Saying_No/
The second item the log shows me is what was requested. In this example, it was the content for the story containing the word blimperskins. This information is important to me as it gives me an idea of the popularity of an article or feature. On several occasions, I have removed features from the site based on the amount of time it took me to maintain the feature versus the usage.
Like most of the information, this is an item I look at in aggregate. I'm comfortable knowing how many people requested an article rather than their location. This is not to say I wouldn't be interested to know what the people from Paris are reading this week. Conceivably, I could find this out with some other analytic tools or if I wrote a query. But, I don't want to spend the money to purchase these tools and I have other things to do with my time than writing database queries.
There are some derived data elements the analytics tools will provide which I find more valuable such as page views. This is a stat that lets me know how many pages someone requests per session. While I prefer seeing a number trending up, I also take notice when I see an abnormally high number. It can signal that something is sucking large amounts of content off my site for purposes other than I intended. One example are the screen scraping tools that grab and repost full articles on other sites for the sole purpose of getting ad revenue. Follow the blimperskins on various search engines and you'll get an idea of what I mean.
This requested data element can be used to see how you moved through the site. Since there is a line written for each page, I could follow your path from article to article. For example, I might notice people will not continue to a Page 2 of an article. I then have to figure out whether that is a content problem or system problem.
I can also spot where requests have been made for something which isn't on the server. Usually, the requests are from script kiddies looking for unpatched servers which may be vulnerable to various exploits.
(3) Time of request - [19/Feb/2006:21:22:06 -0800]
As a single data point, this information is of little value to me unless someone is being naughty. The time reflects when you made the request to the web server. If you take all the lines items associated with an IP address for a given day, you can tell how long someone was on your site. The tools can determine an elapsed time for each session.
I use to watch this elapsed time figure. The theory is the longer someone was on your site, the better. I also looked at when people were connected to see if there was an optimum time to take the service offline for maintenance so I would disturb the least amount of users. As more people have come online from other time zones, there is not an optimum time for scheduled maintenance. The usage is pretty evenly distributed throughout the day. The variance is more on the day of the week which correlates more to when I post an article.
This is also a case where one could get consumed by statistics. Its like looking at the file properties in Microsoft Word for this document and seeing 34 revisions and 909 minutes of total editing time. Who cares? The 909 minutes is the elapsed time since I opened the document. You'll have to pardon me as I did take time off to eat, sleep and do other items and didn't close the document.
(4) Who sent you? - http://www.google.com/search?hl=en&q=blimperskins&btnG=Google+Search
The web has millions of web sites each competing for a slice of your time. My site is just one of those. It intrigues me as to how people find the site which is why I value this data element more than any other. You know how you got to this page, but I need the server logs to assist me.
The first time I found about a referring URL was when I wrote a favorable review of a tool on Fagan Finder. An hour after releasing the edition, I had an email from Michael Fagan. The article URL had shown up in his server log. I was surprised as I knew he wasn't a subscriber so how could he have known about my article. Now, I know.
I group referring URLs into two categories. The first are websites where someone has posted a link to one of my articles or site. Just because my server log has the URL, it doesn't mean I can always access it. Sometimes the URL is restricted by password or a firewall. On too many occasions, it is restricted by my ignorance. This is my way of saying I don't understand your language yet. Of the 56 browser language variations recorded this month, I know 8. Thank heavens there are 5 variations of English or I would feel really stupid.
When I see these URLs, I try to research them as time permits. While I may know the URL, I don't know the context in which the link was referenced. This isn't an issue of my trying to trace where you have been, but more about whether I did a good job. I have made corrections or clarifications to my articles based on some of these comments.
The second category is links from search engines. This seems to be a troubling area for some people after the recent DOJ subpoenas so I wanted to include it. The simple answer is if you click on a result link from a search engine query, that query is passed to our server log.
I probably need to elaborate on that statement as some people maybe hyperventilating. If so, go to the kitchen, grab a paper bag and start breathing into it. Besides, the time you take to calm yourself will help my length of visit stats provided you don't close the browser.
For those of you who are still with me, let me emphasize I only know what were your search terms. I don't know your intent. People can search for information for any number of reasons including doing the search for someone else. Below are top phrases from Google and Yahoo!
I do value this information. Although I don't know why people were querying these terms, I try to ask myself questions that would return these terms. This process gives me ideas for future articles. I'm also intrigued by the differences between the search engines and their queries.
If I scan the full list I see a search query exposing bcc. Now, what comes to your mind when you see that phrase? You don't have to tell me, but here are some questions that pop to my mind.
What's bcc? Does the reader mean blind carbon copy or is this some other abbreviation?
Is the reader trying to send an important email and wants to make sure recipients are protected?
Did the reader send an email with BCCs but forgot who they were?
Was the reader bcced on an email and wants to make sure no one else can see their name?
Did the reader get such an email and want to know who else may have received it?
I'm not going to go crazy and see if this query came from Room 272 or Beijing. Ill just make a mental note and if I see a lot of similar queries, Ill add the idea to my possible article list.
From my perspective, the data is neutral and the server logs are another tool. Like any tool, it can be valuable if used properly. I would like to think the decisions Ive made from analyzing the data has made this site better. I'm also cognizant that if the data is misused, it erodes the confidence of users and diminishes the value of the site.
Update: Article was corrected for two errors (thanks Greg). One was a simple typo of "see" versus "seen". The second, I made a reference to "BBC" and the translation to British Broadcasting Corp. While the translation was correct I should have used an example for BCC.
Related Web Server Log Articles
Last Updated (Saturday, 19 June 2010 13:42)
