Mozilla

Mozilla Websites, Web Analytics and Privacy

April 9th, 2008

This document discusses the application of web analytics tools to Mozilla websites.

We live in a world of data; we should be thinking carefully about that data and its impact. Many people don’t realize how much information about them is collected by websites and used as a business asset. Some of those who do understand don’t care, or figure there’s no sense talking about it. But a core of the Mozilla community is intensely focused on privacy and the individual person’s ability to understand and control personal information. This has always been the case, and it is part of our strength. These aspects should continue to inform the development of both our software and our websites. With this in mind, I’ve put together a discussion of a particular data-gathering proposal, together with the safeguards that make me comfortable with it.

We would like to understand how people interact with Mozilla’s websites, in particular the consumer-facing websites such as www.mozilla.com, mozilla-europe.org and mozilla-japan.org. To do this we want to implement tools that measure what people do when they visit these sites. These tools are generally known as “web analytics” tools. In particular, we want to implement a product called SiteCatelyst from a company called Omniture for a range of Mozilla websites. The specific sites, the phased rollout plan and the evaluation details are below. Using this services means that data about Mozilla visitors will be processed by Omniture, and will be stored on servers that are not under the direct, physical control of Mozilla. This is new to us and requires consideration of appropriate safeguards. Some wonder if it should even be done. I believe the proposal below is worth trying, and that our arrangement with Omniture includes appropriate safeguards.

Commitments

Mozilla will use the web analytics data only to determine aggregate usage patterns for our website. We will not seek to determine personal information from this data. Omniture will use the data from Mozilla websites only to provide and maintain the service for Mozilla; it will not share the information with others or use the information for other purposes. Omniture will not “correlate and report on any Customer Data with any other data collected through other products, services or web properties.” The domain names in Mozilla cookies will clearly identify their affiliation with Mozilla and the Omniture service. We will have public discussions of the results. Before the end of 2008 we will have a public discussion about the benefits (or lack thereof) of using this system. There will be a clear public statement about which web analytic services, if any, are in use with our websites. There will be a public notice and discussion period before including other types of websites, such as developer.mozilla.org and spreadfirefox.com.

Description

One aspect of the Mozilla project that is bigger than many people realize is our website presence. There are actually a number of Mozilla sites. (Or, in industry terms, “website properties.”) There are the development and community-focused sites like developer.mozilla.org, and spreadfirefox.com. And then there are the websites that consumers visit — in particular the download, support and services mozilla.com, mozilla-europe.org, and related sites. The latter are significant web presences, causing Mozilla to periodically appear in the list of top 50 most visited websites published by comScore (an Internet measurement firm analogous to Nielson in the TV space).

1. Our websites act as integral components of our users’ experience. They are also a primary way of communicating with most of our users who aren’t likely to read Planet Mozilla, the newsgroups or other community tools. Today we know very little about how people interact with our websites, in particular the consumer-facing websites. To improve the experience we first need to know some basic data about how users interact with our website properties. We’d like to understand things such as:

  • Is something we think should be easy — like getting from a top-level page to useful add-ons — simple enough for people who aren’t familiar with Mozilla?
  • If we add a landing page with explanations, do people get lost at those pages? Or do these pages help people as we had hoped?
  • How many users successfully find, download, install and become long-term Firefox users?
  • What paths do people take through the website?
  • Is something new (like the dropdown content on the “whatsnew” page) useful to people? How many people see that page and actually click on the links?
  • Do people find the language version of Firefox that fits their location?

2. Each of these websites is large and complex, and each gets an enormous number of visits from general consumers — that is, from people who are not familiar with Mozilla, may not be power users, and whom we can’t claim to understand from our own experiences. Those of us who work on the Mozilla project have — by definition — some familiarity with Mozilla. That is not the case for most of our current 150 or so million users. What feels “easy to use” or comfortable to us could be completely wrong for many people who visit these websites. Furthermore, what might make sense in one language or locale might not be helpful in other languages or cultural contexts.

3. How do we develop a better understanding of how people interact with a website? The basic answer is to gather aggregate data about how people use the website. The term generally used to describe this is “web analytics.” Aggregate data will help us answer the types of questions listed above.

4. What techniques are used to instrument a website so that it aggregates data about usage patterns? Two elements are used together to gather data– “cookies” and “web beacons.” A cookie is a string of information that a Web site stores on a visitor’s computer, and that the visitor’s browser provides to the Web site each time the visitor returns. Because the browser provides this cookie information to the website at each visit, cookies serve as a sort of label that allows a website to “recognize” a browser when it returns to the site. A “web beacon” is a marker placed in a webpage that makes it easier to follow and record the activities of a recognized browser, such as the path of pages visited at a website.

5. Are there negative things that could happen with this data? As with many kinds of data, yes. It is possible to correlate web analytics data with other data and potentially figure out persona information. Mozilla does not do this and Omniture is not allowed to correlate Mozilla data with any other data to derive personal information.

6. What precisely is Mozilla proposing to do? Use a web analytics product from Omniture called SiteCatalyst to measure interaction with a number of our other consumer-facing websites. The proposed rollout of the web analytics is in phases:

  • Phase 1: www.mozilla.com, firefox.com, getfirefox.com, *.mozilla.com. Rollout is pending discussion and feedback on this document. I believe the concerns raised in the newsgroup discussion are addressed, so there may very little discussion to be had. In that case, the implementation will occur shortly. We would also amend our Privacy Policy as appropriate to describe the storage and processing of this data by a third party.
  • Phase 2: www.mozilla-europe.org, possibly mozilla-japan.org, pending discussion and feedback on this document.
  • Phase 3: Discussion and review period of usefulness of data at the end of 2008.
  • Phase 4: (Pending outcome of Phase 3): add other Mozilla websites such as: addons.mozilla.org, developer.mozilla.org, www.mozilla.org, spreadfirefox.com, planet.mozilla.org; or consider use of a different or additional web analytics program.

7. Isn’t there an open-source or free software version that will do the job? Not that we know of.

8. Why don’t be build our own? This is a significant project in which we have no expertise. We need a solution that works at scale, in a complex, distributed setting, and is available now. That’s a serious project to take on, and one that would certainly take a lot of time and focus. We’d need to build a new community of people that embodies Mozilla DNA and values AND build a world-class piece of software. We’re not experts in analytics or in defining requirements, so we would have to wait until a fair amount of development was done before we could even begin to evaluate how helpful the project was. For those people who were around Mozilla since the early days, you will undoubtedly remember the enormous pain of trying to build the application (in those days the Mozilla Application Suite) before we had a solid infrastructure (the Gecko implementation.) The idea of building an analytics package while trying to use it at the same time on websites as complex as the those in question is a recipe for disaster.

9. Why Omniture? Omniture has many positive points. The use of the data is limited to providing the web analytics service to Mozilla. The product SiteCatelyst is widely used solution for large websites; it’s known to scale, be stable, and provide reliable, trustworthy results. Access to the data is highly secured and Omniture provides support resources. In addition, there is a user interface for allowing individuals to opt out of the web analytics processing. There are some drawbacks of course, there usually are. Omniture is not open source code, which we always prefer. Our arrangement with them is contractual. That’s helpful in that it allows us to include the privacy safeguards in the contract. But as is almost always the case the complete contract is confidential. Omniture has been criticized for its business practice of using cookies that don’t clearly say they are from Omniture. It turns out Omniture allows its customers to specify whether they want a cookie with the Omniture name in it. Mozilla cookies will do so. And finally, Omnniture is not free. Use of Omniture requires payment, unlike other options and the cost generally rises with the usage of the sites. So it could get expensive and we’ll have to monitor this.

10. How will we evaluate if the data is worth the effort to get it? We’ll look at the results. We have a set of people who are adapt at looking at data — Ken, Polvi and Daniel, who just joined us. Ken and Polvi have been publishing what we’ve learned from the data we do have, and we’ll see what can be learned from the additional data. We’ve already moved the data (known as “metrics”) discussions into the public via the Metrics Blog We will continue to do this.

11. Will Omniture be used with all Mozilla websites? We don’t know yet. As noted above, we’ll do a review of the consumer-facing sites and see how valuable the data is and how we feel about gathering it. We may also look at alternative providers as part of this discussion. Then we can decide about other sites as well such as our developer and community facing websites.

12. Privacy Policy. Our current privacy policy says that Mozilla data won’t go to an outside third party. So it will need amendment to allow for this case. Details on the proposed changes will follow, but for now I’d like to talk through the goals and proposed techniques.

13. Sensitivity to data, privacy and user control. Most websites (and the organizations running them) are unabashed about collecting data, and using that data to improve their business. The use of web analytics is a standard practice, taken for granted by many website operators. This proposal is an extremely mild version. Some people have suggested to me that this discussion is “much ado about nothing” and reflects an extreme focus on privacy of a portion of the Mozilla community. I agree that this is a mild proposal, collecting the most basic of data. But I don’t believe this discussion, or the basic concern is irrelevant or extreme. As noted above, we live in a world of data; we should be thinking carefully about that data and its impact.

***

Comments welcome here. If you’re interested in the full discussion, head over to the mozilla.org Governance newsgroup. You can also read a set of past comments and participate through the mozilla.governance Google Group.

16 comments for “Mozilla Websites, Web Analytics and Privacy”

  1. 1

    Sander said on April 9th, 2008 at 8:58 am:

    I’m glad that google analytics has been dropped from the original proposal. Having this data go to a third party like Omniture _still_ makes me extremely uncomfortable, but at least I don’t “use” Omniture in other highly privacy-sensitive ways. (Which is not to say that I won’t be blocking all of the tracking requests.)

    As for the lack of open source alternatives to analytics software: I just today became aware of piwik – http://piwik.org/ – which aims to be just that.
    At first glance it looks as if it’s nowhere near the same level yet, but it might provide enough of the basic framework to be worth considering as an alternative; if not now, then at least at a that review period at the end of 2008. (And although you rule out setting up something from scratch ourselves, perhaps Mozilla could still support an open source project like this to _become_ a viable alternative.)

  2. 2

    Chris Jay said on April 9th, 2008 at 8:59 am:

    Why not open up the data generated by the tool? That would provide people with the ultimate confidence that their privacy is not at risk. It would also fit with the Mozilla manifesto, and it would allow the data to be used and analysed by the community in many ways, including perhaps some unexpectedly valuable ones.

  3. 3

    Ian said on April 9th, 2008 at 9:02 am:

    There are some web analytics tools that work on your server logs instead, though I imagine these are less powerful.

  4. 4

    Bill Barry said on April 9th, 2008 at 9:24 am:

    I agree with Chris, this data should be provided to the public. Perhaps some algorithm can be run against the ip addresses before it is made public in order to anonomize it a bit, something like:
    maintain the following table somewhere internally (not public)
    ip address
    previous address in traceroute (or something like that which would generalize the address)
    number of times the traceroute address has shown up so far (may not be necessary; I don’t know enough about this yet)

    every time you have an ip address in your log, do a lookup in that table and replace it with the other two columns; when a lookup fails, add a new record to the table.


    I don’t know how feasible this is, but doing this would give the community a whole lot of data which would be very useful.

  5. 5

    Dave Miller said on April 9th, 2008 at 9:37 am:

    To Ian: there are web analytic tools that run on your web logs. And none of them that we’ve found can keep up with the amount of data our logs generate, which is the reason we’re even looking at this. Our websites are generating multiple GIGABYTES of logs every hour, and none of the tools we’ve been able to find can process it as fast as it comes in, even if we throw pretty massive hardware at it.

  6. 6

    Mitchell Baker said on April 9th, 2008 at 11:52 am:

    Sander: thanks for the pointer; we’ll take a look.

    Chris and Bill: our goal is to be as open as we can. As Bill points out, we can’t just dump the raw data into the public; that could end up in the kind of disclosure people worry about it. I don’t know how feasible it is yet to do the sorts of things Bill mentions. I suspect they are much harder at scale and with the kind of reliability we need than it might appear. So part of what we should keep looking at is what can be public safely, and how to improve the setting.

    Mitchell

  7. 7

    reed said on April 9th, 2008 at 12:11 pm:

    The current privacy policy at http://www.mozilla.com/en-US/privacy-policy.html states “Mozilla sends this information to a third-party service provider to help Mozilla analyze this data. It is possible to link cookies to personally-identifying information, thereby permitting Web site operators, including our third-party analytics provider, to track the online movements of particular individuals.”

    That sure sounds like cookie data is already being sent to a third-party… However, your blog post (section #12) says the exact opposite thing. Which one is correct?

  8. 8

    Toe said on April 10th, 2008 at 12:17 am:

    The last line of Sander’s comment echoes mine: It would be interesting if Mozilla could take an existing open-source analytics package under its wing to bring it up to the level required. It may not be a ‘Mozilla platform’ project, but then, neither are things like Bugzilla. It might not quite meet the ‘available now’ need, but I don’t think this to be an all-or-nothing thing. Perhaps Omniture (or another system) could be used on the Phase 1/2 sites, while the newly Mozilla-ized open source system is being honed on the Phase 4 sites.

  9. 9

    Basil Hashem said on April 10th, 2008 at 2:10 pm:

    @Reed: As far as I’m aware, we are not currently sharing any cookie data or web log info with third-parties. I believe that the Dec-07 revision of the policy was done in preparation for using Omniture/Google Analytics. It’s inaccurate. We’ll roll back the privacy policy to pre-Dec07 and have it reflect reality. Thanks.

  10. 10

    Duane said on April 10th, 2008 at 10:24 pm:

    Because things change all the time I actually set a time out on the pastebin entry. However that said I did a packet dump a couple of weeks agon on firefox v3 beta4 (ubuntu hardy) and when I ran it for the first time, or create a new profile unless you disconnect from the net there really is some rather disturbing things occuring before you can prevent them.

    http://pastebin.com/m79057aba

    Among other things a cookie is being set from Mozilla’s site until 2038, there is at least one connection to google sites causing it to store a cookie as well, both of these pages were obvious, the javascript tracking bug from mozilla wasn’t though.

    Something I didn’t realise that also happened, since no page or warning or information came up about it was that, Firefox pulls an RSS feed from mozilla which was redirected to a the BBC RSS feed.

    I believe Debian/Ubuntu used to ship a static page, no doubt this was from Mozilla and allowed you to start up the damn browser without worring about being tracked from the get go by no less than 3 to 4 companies.

    So how much of this blog entry was lip service exactly, because to me it seems like Mozilla can’t give the user data away fast enough.

  11. 11

    Duane said on April 10th, 2008 at 10:29 pm:

    Oh forgot to mention, because Mozilla no longer seems to be able to protect my privacy, or anyone elses I’ve had to resort to treating firefox as some sort of malware and blocking certain domains as firefox keeps phoning home with information after it was told not to in version 2, there was even more in version 3.

    Yes it’s no secret but the sheer arrogance to turn this on by default at least opera has the good decency to nag us to turn it on, not turning it on for us and then telling us how good it was to send all the information to google.

    So I can only recommend people blacklist numerous mozilla.com, mozilla.org and a few google.com domains/hostnames if you care about your privacy at all.

  12. 12

    Iang said on April 11th, 2008 at 4:39 am:

    Good stuff! It is great to see some brainstorming on the different possibilities. No internal team can see everything and often develops blind spots. Opening up the process, even when highly sensitive, allows all sorts of help to come out. Open governance rocks!

  13. 13

    Mozilla said on April 14th, 2008 at 7:23 am:

    They have their API open right? I think alot of their practices rubbed off on Google

  14. 14

    Nivash Kumar said on April 16th, 2008 at 5:33 am:

    You guys do not need to have a thought about doing this tracking just start doing it, Google has been doing this for a decade on a large number of people than you are targeting by their Google analytics. They offer analytics as a service to web admins but also obviously the results from all sites on the web would be logged by Google themselves. They say it helps to improve the quality of service. Unless common people are aware of something called as cookies you can mine data.

    But do not throw a cookie that expires on 2038 like Google did, When I heard that for the first time I really did not like it. For a corporation like Mozilla,this type of a clear objective mentioned before implementing is a good way to convey a message to people that “We’re always a good company” and things ” for the benefit of public” like you mitchell always say in conferences. This post shows you are really OPEN. Good luck mining data! 🙂

  15. 15

    John Francis Lee said on April 17th, 2008 at 7:54 am:

    ‘ …because Mozilla no longer seems to be able to protect my privacy, or anyone else’s I’ve had to resort to treating firefox as some sort of malware… ‘

    I guess you’ve taken a lot of money from Google and that’s why you’ve turned the same corner as they have.

    I have to agree that you must be treated as not being on our side any longer.

  16. 16

    Nicholas Shiell said on April 20th, 2008 at 3:52 am:

    I DO like Mozilla
    I am cautious about Mozilla’s data mining from their websites.
    But at least they are keeping me in the loop about it.

    (I don’t have to find out about the use of such tools by view-sourcing the web pages to find the beacons there.)

    The web sites are running on Mozilla servers so knowing how people move around the sites can be useful to them.

    Mitchell, it’s nice to be kept in the loop – thanks

Skip past the sidebar