I’m not sure when it started, but there’s been an increase in people asking about internet privacy and data collection. People are rightfully concerned about the use of data and not just on this website. This is good but highlights that I haven’t done the best job telling people what information is captured in a web server log and how it can be used. I can’t answer for all web sites, but I can for this one. My answers may surprise you.
Many of you know I’m concerned about privacy, so I was startled to get a reader request to remove data years back. The more I dug into the issue, the more I realized people don’t know what data is collected. And as we all know, fear takes on a life of its own. I hope that this article will address those concerns.
The Data Collection Process
The process begins when your web browser requests a page from our website. Many sites maintain a server log that records these transactions. A log transaction is generally defined as getting a resource such as a web page text, picture, file, etc. Most websites keep a server log as it has beneficial information such as traffic patterns and errors. In our case, the webserver uses the Apache HTTP combined server log format. This may vary based on the hosting company.
As you move through this website, multiple lines are appended to this log in chronological order. There are multiple lines because a web page consists of many resources such as images, text, CSS style sheets, ads, etc. In other words, the web page you’re viewing may appear as one item to you, but from the web server’s view, there might have been a dozen requests to display the page. Each request becomes a line item in the webserver log. As a result, these logs can be huge. Based on the configuration, they could be produced daily or monthly.
Where’s My Personal Info?
While weblogs contain lots of data, I can’t say they are fun reading. The data can be useful, but you need a log analysis tool or a service like Splunk or Sumo Logic to make sense of the information. Different collection methods may capture different information. Some hosting companies also provide analysis programs.
In the case of the reader who wanted her data removed, she thought these logs collected all sorts of personal info. And in a sense, they do but not like many people think. There is no line item that says Jane Doe from Franklin, Tennessee came to this site from Bing and read 2 articles and left the site by clicking an Amazon book recommendation.
The best way to illustrate why I can’t tell this is to show an example log entry.
Server Log Example
Below, you’ll see one item request from the raw server log that I’ve parsed to make reading easier. I’ve also numbered the data elements. In the webserver log, this information appears as one long line. This example is about 10 years old but still works. I no longer use these logs and don’t wish to reinstall them to update the article. So, excuse me for older references and protocols.
(4) [11/May/2012:10:37:04 -0700]
(5) “GET /mos/Email/Outlook/Creating_Outlook_Signatures/ HTTP/1.1”
(10) (Windows; U; Windows NT 5.1; en-US; rv:1.7.7)
The Log Data Elements
(1) IP address
The first data item is the IP address of the client making the request. A client could be your computer, firewall, proxy, smartphone, and so on. The IP address is dynamic for some people, meaning that it shows as 18.104.22.168 on May 11, but it might be different the next time you visit. Or, in the case of some firewalls, it could be all the computers behind the firewall are using the same IP address. Also, people who use VPNs usually have different IP addresses.
There is more information that can be inferred from an IP address, such as your location. There is a method for assigning blocks of numbers. For example, internet service providers (ISP) or large companies may be assigned blocks of IP addresses. If you want to see how your IP translates, go to Google’s Q&A page on IPs.
People should also know the geographic information isn’t always precise. Many years ago, when I was analyzing our city’s logs, a large number of entries showed Vienna, Virginia. Not a neighboring community to California. At the time, AOL’s networks were set up to show all users from that location. Rest assured your ISP would not give your address and contact information unless they received a court order requesting the data.
It’s important to note that the web logs don’t translate the IP location. The geographic translation is done by an analysis program. A simpler option is for webmasters to use something like Google Analytics. While Google may capture the IP address, it does not provide it to webmasters.
(2) Identity Check
At first, I thought the displayed hyphen was a delimiter, but it actually means data is not available. The field is used for determining the identity of the client machine. The name was a little worrisome until I read the Apache documentation that states, “This information is highly unreliable and should almost never be used except on tightly controlled internal networks. Apache httpd will not even attempt to determine this information unless IdentityCheck is set to On.
Again, the field shows as a hyphen since no data was collected. This field might show data if the article being requested was password protected and I required authentication. I do use this field for internal use to access test areas.
(4) When did the server finish the request
This is the time the web server finished getting your information. The -0700 indicates our web server is 7 time zones behind GMT.
(5) What can I get you?
This line indicates what you requested. In this instance, the reader requested the article on creating Outlook signatures. The HTTP/1.1 indicates what protocol was used. A protocol is a format two devices use to exchange information.
(6) Result Code
This number indicates the status code the server sent back to your browser. If everything worked, you get your request. Otherwise, you might see one of our infamous “Oops…we’re sorry pages (aka 404 errors). In this case, the 200 indicates the page was successfully received by your web browser.
This figure indicates the size of the object returned. In this case, it was the size of the article or 7537 bytes.
(8) Who sent you?
One advantage to the combined log format is it shows who referred you to our site. Don’t worry as the who is never a person. In the example above, the reader searched for the US version of Google for “Outlook signature.” This information is passed along in the URL from search engines or links from other websites.
We should mention that the search engines stopped showing what the reader searched for in the logs many years ago.
(9-12) Browser Information
Items 9-12 are sent by your browser and show which version you’re using and your operating system. In the example above, the client was using the US version of Windows NT 5.1 with version 1.0.3 of Firefox.
What Do You Do With This Data?
The next question is whether I use all this data. The short answer I use some of the data, but not all. While web server logs collect a lot of information, that doesn’t mean it’s correct or meaningful. I’m primarily concerned with trends and what items I might need to change. And as Google Analytics has gotten better, I rely less on these logs.
Are there any pages that are broken that I need to fix?
I can decide there is a problem by looking at items 5 and 6. This is an important issue since a broken or slow web page is a terrible user experience.
What Are People Reading?
OK, no one should ever be shocked that a webmaster wants to know this information. After all, if you’re not reading their content, they don’t have a business. It only makes sense that web admins want to know the most read articles and the least read articles.
Hey, are you new to these parts?
As with any business, you like to get new customers and keep the regulars. This is the type of information you can get after accumulating enough daily log files. Even then, the info isn’t precise because so many people have dynamic IPs or come in using a different device. One way I could avoid this problem is to force people to register, but I don’t.
How did you find us?
As you might expect, item 8 can help us in this regard. I look at the referrer information as it indicates where someone posted information about our site or articles. This gives us an opportunity to read what was said on another website and post our comments if needed.
The biggest concern people usually have is seeing their search terms included in a log entry. I can understand this, as I never knew this happened until I looked at a web server log. The search terms are useful as it gives me an idea of what type of information people need. These keywords have also helped us with language differences where as an US based author I might use one term, but someone from Europe might use a term or phrase I might not know. Yes, I’m still trying to figure out what the Brits mean by a “punter”.
Update: Search engines no longer pass along the keywords.
What browsers are people using?
We use item 12 to answer this question. The reason we’re interested is that different browsers handle the web code in different ways. While the differences may be subtle, there are times where I have abandoned some features because I couldn’t get them to work correctly with a specific browser.
I suppose if I had ample time and budget I would be more proactive with this information. For example, I might offer a reminder to people using older browsers to upgrade as they may be at risk.
The other reason I look at this info is there are certain bots designed to harvest email addresses or images from websites. Since I don’t have forums, I don’t have to worry about this too much. I still block these agents when appropriate.
You downloaded how much data?
Many people have the notion that the web is free. Well, this is true if you don’t have a website. The truth is that websites have data costs in terms of storage or bandwidth transmission. This factors into my hosting agreement.
In most cases, bandwidth isn’t an issue. I’m more than happy to offer content to people. After all, this website intends to help people. However, I draw the line when it becomes clear people are scraping huge chunks of this site for their economic gain.
I suspect the above information answered some questions about internet privacy and web server logs. Certainly, I can answer items about this site, but can’t speak for other sites. The brilliance of the web is how it is interconnected, but it comes with risks. The downside is some sites do install spyware or combine server log information with other databases, which show more information about you than you might be aware. The best defense is to be vigilant about spyware and always read End User License Agreements (EULA) and Privacy Policies.