Many of you know I'm concerned about privacy so I was startled to get a reader request to remove data. The more I dug into the issue it became apparent that people don't know what data is collected. And as we all know, fear takes on a life of its own. I hope that this article will address those concerns.
How the Data Collection Process Starts
The process begins when your web browser requests a page from our website. Most sites maintain a web server log that records these transactions. A log transaction is generally defined as getting a resource such as a web page, picture, file and so on. The server log can be configured to capture various fields. Most websites maintain a server log as it contains very useful information such as traffic patterns and errors. In our case, the web server uses the Apache HTTP combined server log format.
As you move through this website, multiple lines are appended to this log in chronological order. The reason there are multiple lines is that a web page consists of many resources such as images, text, style sheets and so on. In other words, the web page you're viewing may appear as one item to you, but from the web server's view there might have been a dozen requests to display the page. Each request becomes its own line item in the web server log, which is why these logs are very large.
Where's My Personal Info?
While web logs contain lots of data, I can't say they are fun reading. The data can be useful, but you need a log analysis tool or some service like Google Analytics to make sense of the information. Different collection methods may capture different information. As example, my raw server logs show a user's IP address, but Google analytics wouldn't show that data element. Below, you'll see one item request from the raw server log that I've parsed to make reading easier. I've also numbered the data elements. In the web server log, this information appears as one long line.
(4) [11/May/2005:10:37:04 -0700]
(5) "GET /mos/Email/Outlook/Creating_Outlook_Signatures/ HTTP/1.1"
(8) http://www.google.com/search?q=outlook+signature+&hl=en&lr=&start=20&sa=N (9) Mozilla/5.0
(10) (Windows; U; Windows NT 5.1; en-US; rv:1.7.7)
(1) IP address
The first data item is the IP address of the client making the request. A client could be your computer, firewall, proxy, smartphone and so on. For many people, the IP address is dynamic meaning that it shows as 126.96.36.199 on May 11, but it might be different the next time you visit. Or, in the case of some firewalls, it could be all the computers behind the firewall use the same IP address.
There is additional information that can be inferred from an IP address concerning your location as there is a methodology for assigning the numbers. For example, internet service providers (ISP) or large companies may be assigned blocks of IP addresses. If you want to see how your IP translates, go to http://www.showmyip.com/ . This site provides some interesting information about your IP address.
People should also know the geographic information isn't always precise. A common example is the number of people who show as being near Vienna, Virginia because they're AOL users. Rest assured your ISP would not provide your address and contact information unless they received a court order requesting the data.
In most cases, webmasters have a country and domain name lookup feature in their analysis program. The information is aggregated and I can see how many users are from a specific country or domain.
(2) Identity Check
At first, I thought the displayed hyphen was a delimiter, but it actually means data is not available. The field is used for determining the identity of the client machine. The name was a little worrisome until I read the Apache documentation that states, "This information is highly unreliable and should almost never be used except on tightly controlled internal networks. Apache httpd will not even attempt to determine this information unless IdentityCheck is set to On.
Again, the field shows as a hyphen since no data was collected. This field might show data if the article being requested was password protected and I required authentication. I do use this field for internal use to access test areas.
(4) When did the server finish the request
This is the time the web server finished getting your information. The -0700 indicates our web server is 7 time zones behind GMT.
(5) What can I get you?
This line indicates what you requested. In this instance, the reader requested the article on creating Outlook signatures. The HTTP/1.1 indicates what protocol was used. A protocol is a format two devices use to exchange information.
(6) Result Code
This number indicates the status code the server sent back to your browser. If everything worked, you get your request. Otherwise, you might see one of our infamous "Oops...we're sorry pages (aka 404 errors). In this case, the 200 indicates the page was successfully received by your web browser.
This figure indicates the size of the object returned. In this case it was the size of the article or 7537 bytes.
(8) Who sent you?
One advantage to the combined log format is it shows who referred you to our site. Don't worry as the who is never a person. In the example above, the reader did a search on the US version of Google for "Outlook signature". This information is passed along in the URL from search engines or links from other websites.
We should mention that this referrer information isn't based on any marketing or partnering agreements with search engines or sites. If this type of information concerns you, there are software programs that will strip this information.
(9-12) Browser Information
Items 9-12 are sent by your browser and indicate which version you're using and your operating system. In the example above, the client was using the US version of Windows NT 5.1 with version 1.0.3 of Firefox.
What Are You Doing With My Data?
The next question is whether I use all this data. The short answer I use some of the data, but not all. While web server logs collect a lot of information, that doesn't mean it's accurate or meaningful. I'm primarily concerned with trends and what items I might need to change.
The other important item is that to leverage the log information, webmasters need another application that can parse, sort, filter, aggregate and do lookups. I do use a third party package to help answer the following type of questions:
Are there any pages that are broken that I need to fix?
I can determine there is a problem from looking at items 5 and 6. This is an important issue since a broken or slow web page is a terrible user experience.
What Are People Reading?
OK, no one should ever be shocked that a webmaster wants to know this information. After all, if you're not reading their content, they don't have a business. It only makes sense that webmasters want to know what are the most read articles as well as the least read articles.
Hey, are you new to these parts?
As with any business, you like to get new customers and keep the regulars. This is the type of information you can get after accumulating enough daily log files. Even then, the info isn't precise because so many people have dynamic IPs in which case they appear new to the web server. One way I could circumvent this problem is to force people to register, but I don't.
How did you find us?
As you might expect, item 8 can help us in this regard. I look at the referrer information as it indicates where someone posted information about our site or articles. This gives us an opportunity to read what was said on another website and post our comments if needed.
The biggest concern people usually have is seeing their search terms included in a log entry. I can understand this, as I never knew this happened until I looked at a web server log. The search terms are useful as it gives me an idea of what type of information people need. These keywords have also helped us with language differences where as a US based author I might use one term, but someone from Europe might use a term or phrase I might not know. Yes, I'm still trying to figure out what the Brits mean by a "punter".
What browsers are people using?
We use item 12 to answer this question. The reason we're interested is that different browsers handle the web code in different ways. While the differences may be subtle, there are times where I have abandoned some features because I couldn't get them to work correctly with a specific browser.
I suppose if I had ample time and budget I would be more proactive with this information. For example, I might offer a reminder to people using older browsers to upgrade as they may be at risk.
The other reason I look at this info is there are certain bots designed to harvest email addresses or images from websites. Since I don't have forums, I don't have to worry about this too much. I still block these agents when appropriate.
You downloaded how much data?
Many people have the notion that the web is free. Well, this is true if you don't have a website. The truth is that websites have data costs in terms of storage or bandwidth transmission. This is typically set by a contract with a hosting service. If you exceed a contractual term, you pay.
In the majority of cases bandwidth isn't an issue. I'm more than happy to provide content to people and have released articles using the Creative Commons license. After all, the intent of this website is to help people. However, I do draw the line when it becomes apparent people are copying huge chunks of this site for their economic gain.
Browser Cookies Anyone?
I suspect the above information answered some questions about internet privacy and web server logs. Certainly, I can answer items regarding this site, but can't speak for other sites. The brilliance of the web is how it is interconnected, but it comes with risks. The downside is some sites do install spyware or combine server log information with other databases, which reveal more information about you than you might be aware. The best defense is to be vigilant about spyware and always read End User License Agreements (EULA) and Privacy Policies.
Last Updated (Tuesday, 23 August 2011 19:55)