Tuesday, July 29, 2008

Wikipedia's Kittens

Just found a great blog


"....a generalized unit of contributor motivation called a kitten.

1 kitten = the amount of motivation needed to get 1 person to spend 1 minute trying to improve an article

We can say, quite literally, that Wikipedia runs on kittens. In fact, entrepreneurs discover this every day when they try to start a "crowdsourcing" site and nobody shows up. So, what generates kittens? Foremost, it's the possibility of someone else learning from what you wrote -- not just immediately, but at any time in the future"

Kittens are born when there is a perception that the words one write will survive for some time.... something like this, to go by

Number of views on a given day = (Number of views per day).(Chance of surviving one day) ^ (Number of Days that have passed)
With a 1-in-ten-thousand chance of being destroyed each day, the article will rack up exactly seven million views over its lifetime.

Like the author says - thats a LOT of kittens!

Friday, July 25, 2008

Why Wikipedia Succeeded

Larry Sanger (Wikipedia's cofounder)'s take on why Wikipedia succeeded.
Although rather old (2005), the feature has some great insights.


In short, these are the factors
  1. Open content license. We promised contributors that their work would always remain free for others to read. This, as is well known, motivates people to work for the good of the world--and for the many people who would like to teach the whole world, that's a pretty strong motivation.
  2. Focus on the encyclopedia. We said that we were creating an encyclopedia, not a dictionary, etc., and we encouraged people to stick to creating the encyclopedia and not use the project as a debate forum.
  3. Openness. Anyone could contribute. Everyone was specifically made to feel welcome. (E.g., we encouraged the habit of writing on new contributors' user pages, "Welcome to Wikipedia!" etc.) There was no sense that someone would be turned away for not being bright enough, or not being a good enough writer, or whatever.
  4. Ease of editing. Wikis are pretty easy for most people to figure out. In other collaborative systems (like Nupedia), you have to learn all about the system first. Wikipedia had an almost flat learning curve.
  5. Collaborate radically; don't sign articles. Radical collaboration, in which (in principle) anyone can edit any part of anyone else's work, is one of the great innovations of the open source software movement. On Wikipedia, radical collaboration made it possible for work to move forward on all fronts at the same time, to avoid the big bottleneck that is the individual author, and to burnish articles on popular topics to a fine luster.
  6. Offer unedited, unapproved content for further development. This is required if one wishes to collaborate radically. We encouraged putting up their unfinished drafts--as long as they were at least roughly correct--with the idea that they can only improve if there are others collaborating. This is a classic principle of open source software. It helped get Wikipedia started and helped keep it moving. This is why so many original drafts of Wikipedia articles were basically garbage (no offense to anyone--some of my own drafts were sometimes garbage), and also why it is surprising to the uninitiated that many articles have turned out very well indeed.
  7. Neutrality. A firm neutrality policy made it possible for people of widely divergent opinions to work together, without constantly fighting. It's a way to keep the peace.
  8. Start with a core of good people. I think it was essential that we began the project with a core group of intelligent good writers who understood what an encyclopedia should look like, and who were basically decent human beings.
  9. Enjoy the Google effect. We had little to do with this, but had Google not sent us an increasing amount of traffic each time they spidered the growing website, we would not have grown nearly as fast as we did. (See below.)

Thursday, July 24, 2008

Defining Knowledge

Came across this quite elucidating (yes, I've used the word "elucidating"!) definition, explanation about Knowledge...


    • A collection of data is not information.
    • A collection of information is not knowledge.
    • A collection of knowledge is not wisdom.
    • A collection of wisdom is not truth.

The idea is that information, knowledge, and wisdom are more than simply collections. Rather, the whole represents more than the sum of its parts and has a synergy of its own.

We begin with data, which is just a meaningless point in space and time, without reference to either space or time. It is like an event out of context, a letter out of context, a word out of context. The key concept here being "out of context." And, since it is out of context, it is without a meaningful relation to anything else. When we encounter a piece of data, if it gets our attention at all, our first action is usually to attempt to find a way to attribute meaning to it. We do this by associating it with other things. If I see the number 5, I can immediately associate it with cardinal numbers and relate it to being greater than 4 and less than 6, whether this was implied by this particular instance or not. If I see a single word, such as "time," there is a tendency to immediately form associations with previous contexts within which I have found "time" to be meaningful. This might be, "being on time," "a stitch in time saves nine," "time never stops," etc. The implication here is that when there is no context, there is little or no meaning. So, we create context but, more often than not, that context is somewhat akin to conjecture, yet it fabricates meaning."

Monday, July 21, 2008

Model for Viral Growth

One of the things in my mind is to create a sufficiently accurate model to predict viral growth - While digging, I came across this very interesting blog post
with a link here:

Some others

Friday, May 16, 2008

Watson and Crick

On Feb. 28, 1953, Francis Crick walked into the Eagle pub in Cambridge, England, and, as James Watson later recalled, announced that "we had found the secret of life."....

Slightly off-topic, but worth it for the 1959 picture alone

Full TIME article here.

Tracking Memes in the Infosphere

Infosphere is a term used since the 1990s to speculate about the common evolution of the Internet, society and culture. It is a neologism composed of information and sphere. More about its origins here.
The difficulties with memetics are many - and it has been bogged in controversy since a long time. One of the main problems is how to isolate a "meme" ; What IS a meme, anyway?
Start here, but don't expect to find a definite answer - it's not there yet!
The whole discipline (if it can be called that!) is in a similar state to that of genetics in the 1950s. What was a gene? It took Watson and Crick to come up with the molecular structure - the double helix - of DNA, before genetics really took off.
The meme sounds very vague when defined as "a unit of cultural information". An abstract but precise mathematical notion is required - perhaps it can be found in information theory?
So, not even knowing precisely what a meme is, how are we supposed to track them, and build theories around them, and maybe even try and predict stuff with them?

The meme-tracking problem.....
Some possible routes:
Web publication volume and search trends

Hitwise tracks search data of all major search engines, including Google.
Google Trends also tells us the history of search volume on keywords, i.e. how many searches were executed on these keywords over time. This sounds like a good indicator of what's "hot". I am not entirely sure it is a truly accurate indicator of "meme" though. For example, Breaking News of any kind will cause a peak in News Coverage - But News is not Meme!

Published Reports by Professional Market Research Firms:
E.g The Harris Interactive Annual RQ™ study, conducted yearly since 1999, assesses the reputation of the 60 most visible companies in the United States, as perceived by the general public. Changes in reputation are what we want to learn. Perceptions of 'brand' by consumer are one part of it, of course - but again, this is not quite enough or good enough data.

WOM data
: 'Positive word of mouth' data can be sourced from places like Keller and Fay's Talk-Track, a research service that tracks consumer conversations via a weekly survey
sample of 700 consumers aged 13+. Online brand mentions data can be sourced from a service like Nielsen Buzzmetrics. It searches the net for mentions of specific words or phrases on discussion boards, blogs or other places where consumers communicate online.

Advertising Spending: Weekly advertising spending data for television and national magazines can be had from Nielsen's Monitor + database. Online advertising spending can be obtained from AdRelevance, owned by Nielsen Netratings.

Agencies like ComScore MediaMetrix track website visits through a representative panel of 2 million users - another valuable storehouse of data but not "ready made" for meme-tracking by any means.

One problem for any non-US study is the possible difficulty in getting location-specific non-US data.

Research design incorporated open-ended, discoveryoriented in-depth interviews are another option, with the obvious limitation of being impossibly difficult to scale up, or even trust.

The meme-tracking space is supposed to be HOT round about now...
It’s not easy to define this space.....but as Alex Barnett says, these are not meme-trackers - there seems to be no real meme-tracker around. I agree.
"Memeorandum, Megit and Chuquet are not 'meme' trackers. They are news trackers. Or tittle-tattle trackers. Or gossip trackers. Again, generally speaking, there are no 'memes' being tracked at these sites". I especialy like his comment that "The idea that these are 'memetrackers' is actually quite a good example of a meme."
Which brings us back to the question: How does one define a meme, at least in a way for a bot can measure it? (I think if you can define something that a machine can understand then you have done a good job at the definition!)

Some interesting papers, using innovative means to find and interpret data can be found in the Journal of Advertising Research, December 2007. I am going to fish around for those again.

Hint : Technology Memes

While learning quite a lot about Project Management from DavidT, I quite accidentally ran across his post on adoption .

"There are a couple of largely accepted theories that model or predict technology lifecycle and adoption patterns:- The Diffusion of Innovations theory offers a model for how a given technology gets accepted and spreads through markets. Its central point is that technologies spread by gradually addressing the needs of 4 types of users: innovators, early adopters, the early majority, and the late majority (a fifth category, the laggards, might just never get it)- The Technology Acceptance Model (TAM) offers some prediction to End User adoption. The key concept here is that individual users adopt a given technology based on its perceived usefulness and its perceived ease of use.To my knowledge, there isn't an established theory or framework that models evolution trends of Technologies.When looking at the history and evolution of web services, we seem to be in front of species that are spreading, adapting, and diverging much like finches in the Galapagos.The immediate thought that then comes to mind is whether Darwin's Theory of Evolution has some or any relevance to Technology.The theory of evolution defines three basic mechanisms of evolutionary change:. Natural Selection is a process by which traits that are more useful in a given environment become more common over time (because they give better chances of survival), while traits that are harmful become rarer. Gene Flow is the exchange of genes within and between populations, which translates in traits being transferred between populations and species.. Genetic Drift is a purely random shift of the frequency of traits within a population - traits become more or less common in a population because of the long-term statistical effect of the random distribution of genes in each generationHow could these mechanisms apply to technology?- Natural Selection is probably the mechanism most relevant to technology trending.The fitter a technology is to the needs of its market, the more likely it is to stick around, and potentially supersede other technologiesThis is why PCs are more likely to be found today than mainframes, why java is more often used than Fortran, and why soap-based web services have replaced xml-rpc.- Gene Flow is also common in the tech field (although we'd probably want to call it something else).Features and concepts are constantly exchanged between complementary or competing technologies.That's how C# got a memory garbage collection mechanism similar to the one in java, and how row-level locking made it in MS SQL Server after years of Oracle claiming it as a key differentiator.Gene flow is also at the root of hybridization, where traits of different species end-up being combined. This is what might be truly going on right now with REST - which is applying concepts of simpler web protocols, most notably HTTP and RSS, onto Web Services.- Genetic Drift seems at first least relevant to the tech field, but might in fact be the most interesting bit.The core concept in genetic drift is that the random distribution of genes in each generation can have a long-term effect on the frequency of traits in a population (because of the statistical law of large numbers, genetic drift is less likely to occur in large population than in smaller ones).What, if anything, could have a similar impact in the evolution of technologies? What type of mechanisms, if any, can have an effect on the evolution and adoption of a technology, without being connected to its intrinsic fit or value?Obviously there are a lot more forces that dictate the success or demise of technologies than just their core virtues.A strategic alliance with IBM propelled MS-DOS into market dominance; technology companies like Oracle spend millions trying to influence the market; and there is a whole ecosystem of media, analysts, and venture capitalists who strive on generating buzz (PointCast or Twitter come to mind).Who knows - if LISP had been able to be more hip, we might all be using more parenthesis today."

Again, my "meme theory" interest (obsession?) means I cannot but help notice 'one more case that fits'. Technology memes playing out their game of survival in the world....
The question is: How can I model, simulate, and more importantly - validate, prove....and Predict the future?

Thursday, May 15, 2008

Learning about the Social Networks behind Wiki

Reference: Korfiatis, Poulos and Bokos, “Evaluating authoritative sources using social networks: an insight from Wikipedia”, Online Information Review Vol. 30 No. 3, 2006 pp. 252-262

Two Layers of Network

(1) The articles network.
Every article in the Wikipedia contains references to other articles as well as external references. A set of links used for classification purposes is also available in most of the active articles of the encyclopedia. Every article represents a vertex in the article network and the internal connections between the article edges of the network.

(2) The contributors network.

Wikipedia is a collaborative writing effort, which means that an article has multiple contributors. We assume that a contributor establishes a relationship with another contributor if they work on the same article. In the resultant signed network, a vertex represents each contributor, and their social ties (positive or negative) are represented by an edge denoting the sequence of their social interaction.Visualization of the Social Network of contributors behind the article "Immanuel Kant"

...And of course, I had already mentioned Chris Harrison's WikiViz Project. Talk about beautiful graphs!

You can also have a look at the Clusterball Project. Dont miss the movie.

10 Questions, with Jimmy Wales

This might be an oldish article, but just thought this might be a good time to dig this up.


Jimmy Wales' answer to "Why do people contribute" is especially interesting

"...It's realizing that doing intellectual things socially is a lot of fun—it makes sense. We don't plan on paying people, either, to contribute. People don't ask, "Gosh, why are all these people playing basketball for fun? Some people get paid a lot of money to do that."

He also says "It turns out that people aren't as horrible as the Internet made them seem for a while."

All the Interesting Questions about Wikipedia on One Page

Just quickly, off the top of my mind, I can think of these

How fast will Wikipedia continue to grow in the near and far future?
Is there a limit to the growth (as per the logarithmic growth model)?
(...Or is "To know all is not permitted" !!?)
How will Quality of Content be affected in the near and far future, as Wikipedia grows?

Is Wikipedia truly Reliable? Would you bet your life on a Fact from Wikipedia?
Can the regulatory mechanism be improved? How?
Do "Editing Wars" always lead to the "unbiased Truth"? (Can the Truth itself oscillate?!)
Does the process of "finding consensus" always lead to the best entry? (Is the "Average" always the right answer? Are there cases where the 'populist decision' may not be the 'right one' ?)

What motivates Contributors? (The question "What motivates Seekers?" is fairly trivial)
What is the "carrot" at the end of the stick for contributors?
Will a contributor contribute content that has a high personal cost (or opportunity cost!) associated with sharing? (e.g: A stock trader chancing upon and then disclosing a piece of positive/negative news about a listed firm before it has broken on any other news channel, and without profiting personally, Or an inventor publishing on Wikipedia without thought of personal gain from his/her invention)

(I use the term very loosely here, as the "sum total of all human social knowledge".)
Can one uncover hidden cultural facets by studying the topography of the wikipedia network, by observing clusters, by deducing from "association" something of value? (somehow, again, I think - "Meme Theory"!)
What is the contribution of Wikipedia to Culture?

Wednesday, May 14, 2008

Computational Trust in Web Content Quality

Interesting points I found in Pierpaolo Dondio and Stephen Barrett, "Computational Trust in Web Content Quality: A Comparative Evalutation on the Wikipedia Project",
Informatica 31 (2007) 151–160

The problem of identifying useful and trustworthy information on the World Wide Web is becoming increasingly acute as new tools such as wikis and blogs simplify and democratize publication. It is not hard to predict that in the future the direct reliance on this material will expand and the problem of evaluating the trustworthiness of this kind of content become crucial. The Wikipedia project represents the most successful and discussed example of such online resources. In this paper we present a method to predict Wikipedia articles trustworthiness based on computational trust techniques and a deep domain-specific analysis. Our assumption is that a deeper understanding of what in general defines high-standard and expertise in domains related to Wikipedia – i.e. content quality in a collaborative environment – mapped onto Wikipedia elements would lead to a complete set of mechanisms to sustain trust in Wikipedia context. We present a series of experiment. The first is a study-case over a specific category of articles; the second is an evaluation over 8 000 articles representing 65% of the overall
Wikipedia editing activity. We report encouraging results on the automated evaluation of Wikipedia content using our domain-specific expertise method. Finally, in order to appraise the value added by using domain-specific expertise, we compare our results with the ones obtained with a pre-processed cluster analysis, where complex expertise is mostly replaced by training and automatic classification of common features.

I thought interesting:
Ciolek, T., Today's WWW, Tomorrow's MMM: The specter of multi-media mediocrity, IEEE
COMPUTER, Vol 29(1) pp. 106-108, January 1996.
Predicted a seriously negative future for online content quality by describing the World
Wide Web (WWW) as “a nebulous, ever-changing multitude of computer sites that house continually changing chunks of multimedia information, the global sum of the uncoordinated activities of several hundreds of thousands of people”

.....On one hand, recent exceptional cases have brought to the attention the question of Wikipedia trustworthiness. In an article published on the 29th of November in USA Today , Seigenthaler, a former administrative assistant to Robert Kennedy, wrote about his anguish after learning about a false Wikipedia entry that listed him as having been briefly suspected of involvement in the assassinations of both John Kennedy and Robert Kennedy. The 78-year-old Seigenthaler got
Wikipedia founder Jimmy Wales to delete the defamatory information in October. Unfortunately, that was four months after the original posting. The news was further proof that Wikipedia has no accountability and no place in the world of serious information gathering .

How much do you trust wikipedia? (March 2006)
In December 2005, a detailed analysis carried out by the magazine Nature compared the accuracy of Wikipedia against the Encyclopaedia Britannica. Nature identified a set of 42
articles, covering a broad range of scientific disciplines, and sent them to relevant experts for peer review. The results are encouraging: the investigation suggests that Britannica’s advantage may not be great, at least when it comes to science entries. The difference in accuracy was
not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three. Reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica respectively.
Gales, J. Encyclopaedias goes head a head, Nature Magazine, issue N. 438, 15, 2005

“trust is a subjective assessment of another’s influence in terms of the extent of one’s perceptions about the quality and significance of another’s impact over one’s outcomes in a given situation, such that one’s expectation of, openness to, and inclination toward such influence provide a sense of control over the potential outcomes of the situation.” - Romano
Computational trust was first defined by S. Marsh, as a new technique able to make agents less vulnerable in their behaviour in a computing world that appears to be malicious rather than cooperative, and thus to allow interaction and cooperation where previously there could be none.
Ziegler and Golbeck studied interesting correlation between similarity and trust among social network users: there is indication that similarity may be evidence of trust.
The most visited and edited articles reach an average editing rate of 50 modifications per day..."Speed" is one of the requirements that conventional techniques do not match up to.
In general, user past-experience with a Web site is only at 14th position among the criteria
used to assess the quality of a Web site with an incidence of 4.6% . We conclude that a mechanism to evaluate articles trustworthiness relying exclusively on their present state is required.
Alexander identified three basic requirements: objectivity, completeness and pluralism. The first requirement guarantees that the information is unbiased, the second assesses that the information should not be incomplete, the third stresses the importance of avoiding situations in which information is restricted to a particular viewpoint.

Modeling Wikipedia

I won't elaborate on their experiment in detail, but jump straight to the conclusion.
They claim to have proposed a transparent, noninvasive and automatic method to evaluate the
trustworthiness of Wikipedia articles. The method was able to estimate the trustworthiness of articles relying only on their present state, a characteristic needed in order to cope with the changing nature of Wikipedia.

Is Wikipedia TrustWorthy?

Some people, especially academics, are uncomfortable with Wikipedia as a "source" of knowledge. Notwithstanding the regulatory and control mechanisms to prevent "vandalism" of content, there is still skeptism among most academics about how far Wikipedia can be a trustworthy resource.
In COMMUNICATIONS OF THE ACM September 2007/Vol. 50, No. 9, Neil L. Waters explains "Why You Can’t Cite Wikipedia in My Class"
A recent post on SlashDot quotes another IT professor saying:
"People are unwittingly trusting the information they find on Wikipedia, yet experience has shown it can be wrong, incomplete, biased, or misleading"

There was an interesting case recently of a "circular reference" created by Wikipedia. "Ali G", claimed a wikipedia entry, had worked for Goldman Sachs. No sources were given. This found its way into a popular mainstream media journal and Wikipedia became a reference to itself!

Where does that leave us?
I'll leave you with a quote from http://tech.slashdot.org/comments.pl?sid=521670&cid=23103370 (reference link above)
The real Wiki-vandals are the companies, governments and lobby groups of all sorts that flood Wikipedia with their squeaky clean corporate profiles (yes, corporate governments), whipped straight from their websites … These entities are the true threat to the laudable goal of Wikipedia to provide a freely accessible forum for the production and storage of (hopefully well-referenced) articles for the masses and a forum that does not restrict the privilege of contribution to those that have jumped through the all the right hoops. … The printed word is no more reliable than the plasma. Lies may be propagated on Wikipedia, but not without debate. Politicians spouting their sludge find their propaganda sitting side-by-side with those that mock them… If knowing that anything in a Wikipedia article is as likely to be crap as correct, the average reader becomes more vigilant in clicking through to the supporting sources; then Wikipedia has served the purpose of bringing to the masses the healthy skepticism that is, after all, the cornerstone of all academic pursuits.Dark eyes look down from ivory towers. Do they cheer or do they fear?

Visualizing Wikipedia

Chris Harrison, over at the Human-Computer Interaction Institute at Carnegie Mellon University has some beautiful visualizations of Wikipedia's network structure. He calls this project WikiViz.
Apart from the stunning visuals, I am thankful to Chris for two ideas that came to me:
- Can Visualizations in the form of Graphs of Wikipedia be analyzed from the point of view of Meme Theory?
- Can I use GraphViz (or similar tools) to develop visualizations for concepts in Religious Texts?
Chris has, in another interesting project, undertaken a visualization of the social network present in the Bible (http://www.chrisharrison.net/projects/bibleviz/index.html).
Similar visualizations for the Gita and Quran should be possible. What would be the motive?
Some vague thoughts swimming in my mind at this stage- Memes, Clusters, Selective Pressures, Evolutionary theories of Culture.

Modeling Wikipedia's Growth

Wikipedia is one of the most interesting phenomena of current times.

One interesting area of study is the modeling of its growth.

If Wikipedia's growth follows the exponential growth model the average rate of growth would be proportional to the size of the Wikipedia. However, it appears that the rate of growth is slowing Maybe Wikipedia's growth follows the logistic growth model better. This model is based on:
- more content leads to more traffic, which in turn leads to more new content
- however, more content also leads to less potential content, and hence less new content
- the limit is the combined expertise of the possible participants.

Interestingly enough, while the number of articles may not be strictly following the exponential curve, we may consider that the quality of articles is, i.e if we assume that the number of edits per article is a measure of its quality.
The graph is plotted in logarithmic scale, and this data also fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 504 days.