I’m not sure when it started, but there’s been an increase in people asking about internet privacy and data collection. People are concerned about the use of data in general and not just on this website. This is good, but it also highlights that I haven’t done the best job in telling people what information is captured in a web server log and how it can be used. I can’t answer for all web sites, but I can for this one. My answers may surprise you.
Many of you know I’m concerned about privacy so I was startled to get a reader request to remove data. The more I dug into the issue it became clear that people don’t know what data is collected. And as we all know, fear takes on a life of its own. I hope that this article will address those concerns.
How the Data Collection Process Starts
The process begins when your web browser requests a page from our website. Most sites maintain a web server log that records these transactions. A log transaction is generally defined as getting a resource such as a web page, picture, file and so on. The server log can be configured to capture various fields. Most websites keep up a server log as it has very useful information such as traffic patterns and errors. In our case, the web server uses the Apache HTTP combined server log format. This may vary based on the hosting company.
As you move through this website, multiple lines are appended to this log in chronological order. The reason there are multiple lines is that a web page consists of many resources such as images, text, style sheets, ads and so on. In other words, the web page your viewing may appear as one item to you, but from the web server’s view, there might have been a dozen requests to display the page. Each request becomes its own line item in the web server log, which is why these logs are very large.
Where’s My Personal Info?
While web logs contain lots of data, I can’t say they are fun reading. The data can be useful, but you need a log analysis tool or some service like Google Analytics to make sense of the information. Different collection methods may capture different information. For example, my raw server logs show a user’s IP address, but Google analytics wouldn’t show that data element. Below, you’ll see one item request from the raw server log that I’ve parsed to make reading easier. I’ve also numbered the data elements. In the web server log, this information appears as one long line.
(4) [11/May/2012:10:37:04 -0700]
(5) “GET /mos/Email/Outlook/Creating_Outlook_Signatures/ HTTP/1.1″
(10) (Windows; U; Windows NT 5.1; en-US; rv:1.7.7)
(1) IP address
The first data item is the IP address of the client making the request. A client could be your computer, firewall, proxy, smart phone and so on. For some people, the IP address is dynamic meaning that it shows as 126.96.36.199 on May 11, but it might be different the next time you visit. Or, in the case of some firewalls, it could be all the computers behind the firewall use the same IP address.
There is more information that can be inferred from an IP address about your location as there is a method for assigning the numbers. For example, internet service providers (ISP) or large companies may be assigned blocks of IP addresses. If you want to see how your IP translates, go to Google’s Q&A page on IPs/.
People should also know the geographic information isn’t always precise. Several years ago, a common example city was Vienna, Virginia because of the way AOL’s networks were set up. Rest assured your ISP would not give your address and contact information unless they received a court order requesting the data.
In most cases, webmasters have a country and domain name lookup feature in their analysis program. The information is aggregated and I can see how many users are from a specific country or domain.
(2) Identity Check
At first, I thought the displayed hyphen was a delimiter, but it actually means data is not available. The field is used for determining the identity of the client machine. The name was a little worrisome until I read the Apache documentation that states, “This information is highly unreliable and should almost never be used except on tightly controlled internal networks. Apache httpd will not even attempt to determine this information unless IdentityCheck is set to On.
Again, the field shows as a hyphen since no data was collected. This field might show data if the article being requested was password protected and I required authentication. I do use this field for internal use to access test areas.
(4) When did the server finish the request
This is the time the web server finished getting your information. The -0700 indicates our web server is 7 time zones behind GMT.
(5) What can I get you?
This line indicates what you requested. In this instance, the reader requested the article on creating Outlook signatures. The HTTP/1.1 indicates what protocol was used. A protocol is a format two devices use to exchange information.
(6) Result Code
This number indicates the status code the server sent back to your browser. If everything worked, you get your request. Otherwise, you might see one of our infamous “Oops…we’re sorry pages (aka 404 errors). In this case, the 200 indicates the page was successfully received by your web browser.
This figure indicates the size of the object returned. In this case, it was the size of the article or 7537 bytes.
(8) Who sent you?
One advantage to the combined log format is it shows who referred you to our site. Don’t worry as the who is never a person. In the example above, the reader did a search on the US version of Google for “Outlook signature”. This information is passed along in the URL from search engines or links from other websites.
We should mention that this referrer information isn’t based on any marketing or partnering agreements with search engines or sites. If this type of information concerns you, there are software programs that will strip this information.
(9-12) Browser Information
Items 9-12 are sent by your browser and show which version you’re using and your operating system. In the example above, the client was using the US version of Windows NT 5.1 with version 1.0.3 of Firefox.
What Are You Doing With My Data?
The next question is whether I use all this data. The short answer I use some of the data, but not all. While web server logs collect a lot of information, that doesn’t mean it’s correct or meaningful. I’m primarily concerned with trends and what items I might need to change. And as Google Analytics has gotten better, I rely less on these logs.
The other important item is that to leverage the log information, webmasters need another application that can parse, sort, filter, aggregate and do lookups. I do use a third-party package to help answer the questions below. If you’re interested in log analysis, you might check out Splunk.
Are there any pages that are broken that I need to fix?
I can decide there is a problem by looking at items 5 and 6. This is an important issue since a broken or slow web page is a terrible user experience.
What Are People Reading?
OK, no one should ever be shocked that a webmaster wants to know this information. After all, if you’re not reading their content, they don’t have a business. It only makes sense that webmasters want to know what are the most read articles as well as the least read articles.
Hey, are you new to these parts?
As with any business, you like to get new customers and keep the regulars. This is the type of information you can get after accumulating enough daily log files. Even then, the info isn’t precise because so many people have dynamic IPs in which case they appear new to the web server. One way I could avoid this problem is to force people to register, but I don’t.
How did you find us?
As you might expect, item 8 can help us in this regard. I look at the referrer information as it indicates where someone posted information about our site or articles. This gives us an opportunity to read what was said on another website and post our comments if needed.
The biggest concern people usually have is seeing their search terms included in a log entry. I can understand this, as I never knew this happened until I looked at a web server log. The search terms are useful as it gives me an idea of what type of information people need. These keywords have also helped us with language differences where as an US based author I might use one term, but someone from Europe might use a term or phrase I might not know. Yes, I’m still trying to figure out what the Brits mean by a “punter”. (Update: Many search engines no longer pass along the keyword.)
What browsers are people using?
We use item 12 to answer this question. The reason we’re interested is that different browsers handle the web code in different ways. While the differences may be subtle, there are times where I have abandoned some features because I couldn’t get them to work correctly with a specific browser.
I suppose if I had ample time and budget I would be more proactive with this information. For example, I might offer a reminder to people using older browsers to upgrade as they may be at risk.
The other reason I look at this info is there are certain bots designed to harvest email addresses or images from websites. Since I don’t have forums, I don’t have to worry about this too much. I still block these agents when appropriate.
You downloaded how much data?
Many people have the notion that the web is free. Well, this is true if you don’t have a website. The truth is that websites have data costs in terms of storage or bandwidth transmission. This is typically set by a contract with a hosting service. If you exceed a contractual term, you pay.
In the most cases, bandwidth isn’t an issue. I’m more than happy to offer content to people when asked. After all, the intent of this website is to help people. However, I do draw the line when it becomes clear people are copying huge chunks of this site for their economic gain.
Browser Cookies Anyone?
I suspect the above information answered some questions about internet privacy and web server logs. Certainly, I can answer items about this site, but can’t speak for other sites. The brilliance of the web is how it is interconnected, but it comes with risks. The downside is some sites do install spyware or combine server log information with other databases, which show more information about you than you might be aware. The best defense is to be vigilant about spyware and always read End User License Agreements (EULA) and Privacy Policies.