Myshkin's travel blog: May 2008

Friday, May 16, 2008

Watson and Crick

On Feb. 28, 1953, Francis Crick walked into the Eagle pub in Cambridge, England, and, as James Watson later recalled, announced that "we had found the secret of life."....

Slightly off-topic, but worth it for the 1959 picture alone

Full TIME article here.

Tracking Memes in the Infosphere

Infosphere is a term used since the 1990s to speculate about the common evolution of the Internet, society and culture. It is a neologism composed of information and sphere. More about its origins here.
The difficulties with memetics are many - and it has been bogged in controversy since a long time. One of the main problems is how to isolate a "meme" ; What IS a meme, anyway?
Start here, but don't expect to find a definite answer - it's not there yet!
The whole discipline (if it can be called that!) is in a similar state to that of genetics in the 1950s. What was a gene? It took Watson and Crick to come up with the molecular structure - the double helix - of DNA, before genetics really took off.
The meme sounds very vague when defined as "a unit of cultural information". An abstract but precise mathematical notion is required - perhaps it can be found in information theory?
So, not even knowing precisely what a meme is, how are we supposed to track them, and build theories around them, and maybe even try and predict stuff with them?

The meme-tracking problem.....
Some possible routes:
Web publication volume and search trends

Hitwise tracks search data of all major search engines, including Google.
Google Trends also tells us the history of search volume on keywords, i.e. how many searches were executed on these keywords over time. This sounds like a good indicator of what's "hot". I am not entirely sure it is a truly accurate indicator of "meme" though. For example, Breaking News of any kind will cause a peak in News Coverage - But News is not Meme!

Published Reports by Professional Market Research Firms:
E.g The Harris Interactive Annual RQ™ study, conducted yearly since 1999, assesses the reputation of the 60 most visible companies in the United States, as perceived by the general public. Changes in reputation are what we want to learn. Perceptions of 'brand' by consumer are one part of it, of course - but again, this is not quite enough or good enough data.

WOM data: 'Positive word of mouth' data can be sourced from places like Keller and Fay's Talk-Track, a research service that tracks consumer conversations via a weekly survey
sample of 700 consumers aged 13+. Online brand mentions data can be sourced from a service like Nielsen Buzzmetrics. It searches the net for mentions of specific words or phrases on discussion boards, blogs or other places where consumers communicate online.

Advertising Spending: Weekly advertising spending data for television and national magazines can be had from Nielsen's Monitor + database. Online advertising spending can be obtained from AdRelevance, owned by Nielsen Netratings.

Agencies like ComScore MediaMetrix track website visits through a representative panel of 2 million users - another valuable storehouse of data but not "ready made" for meme-tracking by any means.

One problem for any non-US study is the possible difficulty in getting location-specific non-US data.

Research design incorporated open-ended, discoveryoriented in-depth interviews are another option, with the obvious limitation of being impossibly difficult to scale up, or even trust.

The meme-tracking space is supposed to be HOT round about now...
http://www.techcrunch.com/2006/02/04/a-look-at-the-memeorandum-killers/
It’s not easy to define this space.....but as Alex Barnett says, these are not meme-trackers - there seems to be no real meme-tracker around. I agree.
"Memeorandum, Megit and Chuquet are not 'meme' trackers. They are news trackers. Or tittle-tattle trackers. Or gossip trackers. Again, generally speaking, there are no 'memes' being tracked at these sites". I especialy like his comment that "The idea that these are 'memetrackers' is actually quite a good example of a meme."
Which brings us back to the question: How does one define a meme, at least in a way for a bot can measure it? (I think if you can define something that a machine can understand then you have done a good job at the definition!)

Some interesting papers, using innovative means to find and interpret data can be found in the Journal of Advertising Research, December 2007. I am going to fish around for those again.

Hint : Technology Memes

While learning quite a lot about Project Management from DavidT, I quite accidentally ran across his post on adoption .

"There are a couple of largely accepted theories that model or predict technology lifecycle and adoption patterns:- The Diffusion of Innovations theory offers a model for how a given technology gets accepted and spreads through markets. Its central point is that technologies spread by gradually addressing the needs of 4 types of users: innovators, early adopters, the early majority, and the late majority (a fifth category, the laggards, might just never get it)- The Technology Acceptance Model (TAM) offers some prediction to End User adoption. The key concept here is that individual users adopt a given technology based on its perceived usefulness and its perceived ease of use.To my knowledge, there isn't an established theory or framework that models evolution trends of Technologies.When looking at the history and evolution of web services, we seem to be in front of species that are spreading, adapting, and diverging much like finches in the Galapagos.The immediate thought that then comes to mind is whether Darwin's Theory of Evolution has some or any relevance to Technology.The theory of evolution defines three basic mechanisms of evolutionary change:. Natural Selection is a process by which traits that are more useful in a given environment become more common over time (because they give better chances of survival), while traits that are harmful become rarer. Gene Flow is the exchange of genes within and between populations, which translates in traits being transferred between populations and species.. Genetic Drift is a purely random shift of the frequency of traits within a population - traits become more or less common in a population because of the long-term statistical effect of the random distribution of genes in each generationHow could these mechanisms apply to technology?- Natural Selection is probably the mechanism most relevant to technology trending.The fitter a technology is to the needs of its market, the more likely it is to stick around, and potentially supersede other technologiesThis is why PCs are more likely to be found today than mainframes, why java is more often used than Fortran, and why soap-based web services have replaced xml-rpc.- Gene Flow is also common in the tech field (although we'd probably want to call it something else).Features and concepts are constantly exchanged between complementary or competing technologies.That's how C# got a memory garbage collection mechanism similar to the one in java, and how row-level locking made it in MS SQL Server after years of Oracle claiming it as a key differentiator.Gene flow is also at the root of hybridization, where traits of different species end-up being combined. This is what might be truly going on right now with REST - which is applying concepts of simpler web protocols, most notably HTTP and RSS, onto Web Services.- Genetic Drift seems at first least relevant to the tech field, but might in fact be the most interesting bit.The core concept in genetic drift is that the random distribution of genes in each generation can have a long-term effect on the frequency of traits in a population (because of the statistical law of large numbers, genetic drift is less likely to occur in large population than in smaller ones).What, if anything, could have a similar impact in the evolution of technologies? What type of mechanisms, if any, can have an effect on the evolution and adoption of a technology, without being connected to its intrinsic fit or value?Obviously there are a lot more forces that dictate the success or demise of technologies than just their core virtues.A strategic alliance with IBM propelled MS-DOS into market dominance; technology companies like Oracle spend millions trying to influence the market; and there is a whole ecosystem of media, analysts, and venture capitalists who strive on generating buzz (PointCast or Twitter come to mind).Who knows - if LISP had been able to be more hip, we might all be using more parenthesis today."

Again, my "meme theory" interest (obsession?) means I cannot but help notice 'one more case that fits'. Technology memes playing out their game of survival in the world....
The question is: How can I model, simulate, and more importantly - validate, prove....and Predict the future?

Thursday, May 15, 2008

Learning about the Social Networks behind Wiki

Reference: Korfiatis, Poulos and Bokos, “Evaluating authoritative sources using social networks: an insight from Wikipedia”, Online Information Review Vol. 30 No. 3, 2006 pp. 252-262

Two Layers of Network

(1) The articles network.
Every article in the Wikipedia contains references to other articles as well as external references. A set of links used for classification purposes is also available in most of the active articles of the encyclopedia. Every article represents a vertex in the article network and the internal connections between the article edges of the network.

(2) The contributors network.

Wikipedia is a collaborative writing effort, which means that an article has multiple contributors. We assume that a contributor establishes a relationship with another contributor if they work on the same article. In the resultant signed network, a vertex represents each contributor, and their social ties (positive or negative) are represented by an edge denoting the sequence of their social interaction.

Visualization of the Social Network of contributors behind the article "Immanuel Kant"

...And of course, I had already mentioned Chris Harrison's WikiViz Project. Talk about beautiful graphs!

You can also have a look at the Clusterball Project. Dont miss the movie.

10 Questions, with Jimmy Wales

This might be an oldish article, but just thought this might be a good time to dig this up.

http://www.time.com/time/business/article/0,8599,1601491,00.html

Jimmy Wales' answer to "Why do people contribute" is especially interesting

"...It's realizing that doing intellectual things socially is a lot of fun—it makes sense. We don't plan on paying people, either, to contribute. People don't ask, "Gosh, why are all these people playing basketball for fun? Some people get paid a lot of money to do that."

He also says "It turns out that people aren't as horrible as the Internet made them seem for a while."

All the Interesting Questions about Wikipedia on One Page

Just quickly, off the top of my mind, I can think of these

Growth
How fast will Wikipedia continue to grow in the near and far future?
Is there a limit to the growth (as per the logarithmic growth model)?
(...Or is "To know all is not permitted" !!?)
How will Quality of Content be affected in the near and far future, as Wikipedia grows?

Content
Is Wikipedia truly Reliable? Would you bet your life on a Fact from Wikipedia?
Can the regulatory mechanism be improved? How?
Do "Editing Wars" always lead to the "unbiased Truth"? (Can the Truth itself oscillate?!)
Does the process of "finding consensus" always lead to the best entry? (Is the "Average" always the right answer? Are there cases where the 'populist decision' may not be the 'right one' ?)

Motivation
What motivates Contributors? (The question "What motivates Seekers?" is fairly trivial)
What is the "carrot" at the end of the stick for contributors?
Will a contributor contribute content that has a high personal cost (or opportunity cost!) associated with sharing? (e.g: A stock trader chancing upon and then disclosing a piece of positive/negative news about a listed firm before it has broken on any other news channel, and without profiting personally, Or an inventor publishing on Wikipedia without thought of personal gain from his/her invention)

Culture
(I use the term very loosely here, as the "sum total of all human social knowledge".)
Can one uncover hidden cultural facets by studying the topography of the wikipedia network, by observing clusters, by deducing from "association" something of value? (somehow, again, I think - "Meme Theory"!)
What is the contribution of Wikipedia to Culture?

Wednesday, May 14, 2008

Computational Trust in Web Content Quality

Interesting points I found in Pierpaolo Dondio and Stephen Barrett, "Computational Trust in Web Content Quality: A Comparative Evalutation on the Wikipedia Project",
Informatica 31 (2007) 151–160

Abstract
The problem of identifying useful and trustworthy information on the World Wide Web is becoming increasingly acute as new tools such as wikis and blogs simplify and democratize publication. It is not hard to predict that in the future the direct reliance on this material will expand and the problem of evaluating the trustworthiness of this kind of content become crucial. The Wikipedia project represents the most successful and discussed example of such online resources. In this paper we present a method to predict Wikipedia articles trustworthiness based on computational trust techniques and a deep domain-specific analysis. Our assumption is that a deeper understanding of what in general defines high-standard and expertise in domains related to Wikipedia – i.e. content quality in a collaborative environment – mapped onto Wikipedia elements would lead to a complete set of mechanisms to sustain trust in Wikipedia context. We present a series of experiment. The first is a study-case over a specific category of articles; the second is an evaluation over 8 000 articles representing 65% of the overall
Wikipedia editing activity. We report encouraging results on the automated evaluation of Wikipedia content using our domain-specific expertise method. Finally, in order to appraise the value added by using domain-specific expertise, we compare our results with the ones obtained with a pre-processed cluster analysis, where complex expertise is mostly replaced by training and automatic classification of common features.

I thought interesting:
Ciolek, T., Today's WWW, Tomorrow's MMM: The specter of multi-media mediocrity, IEEE
COMPUTER, Vol 29(1) pp. 106-108, January 1996.
Predicted a seriously negative future for online content quality by describing the World
Wide Web (WWW) as “a nebulous, ever-changing multitude of computer sites that house continually changing chunks of multimedia information, the global sum of the uncoordinated activities of several hundreds of thousands of people”

.....On one hand, recent exceptional cases have brought to the attention the question of Wikipedia trustworthiness. In an article published on the 29th of November in USA Today , Seigenthaler, a former administrative assistant to Robert Kennedy, wrote about his anguish after learning about a false Wikipedia entry that listed him as having been briefly suspected of involvement in the assassinations of both John Kennedy and Robert Kennedy. The 78-year-old Seigenthaler got
Wikipedia founder Jimmy Wales to delete the defamatory information in October. Unfortunately, that was four months after the original posting. The news was further proof that Wikipedia has no accountability and no place in the world of serious information gathering .

How much do you trust wikipedia? (March 2006)
http://news.com.com/20091025_3-5984535.html
In December 2005, a detailed analysis carried out by the magazine Nature compared the accuracy of Wikipedia against the Encyclopaedia Britannica. Nature identified a set of 42
articles, covering a broad range of scientific disciplines, and sent them to relevant experts for peer review. The results are encouraging: the investigation suggests that Britannica’s advantage may not be great, at least when it comes to science entries. The difference in accuracy was
not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three. Reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica respectively.
Gales, J. Encyclopaedias goes head a head, Nature Magazine, issue N. 438, 15, 2005

Trust
“trust is a subjective assessment of another’s influence in terms of the extent of one’s perceptions about the quality and significance of another’s impact over one’s outcomes in a given situation, such that one’s expectation of, openness to, and inclination toward such influence provide a sense of control over the potential outcomes of the situation.” - Romano
Computational trust was first defined by S. Marsh, as a new technique able to make agents less vulnerable in their behaviour in a computing world that appears to be malicious rather than cooperative, and thus to allow interaction and cooperation where previously there could be none.
Ziegler and Golbeck studied interesting correlation between similarity and trust among social network users: there is indication that similarity may be evidence of trust.
The most visited and edited articles reach an average editing rate of 50 modifications per day..."Speed" is one of the requirements that conventional techniques do not match up to.
In general, user past-experience with a Web site is only at 14th position among the criteria
used to assess the quality of a Web site with an incidence of 4.6% . We conclude that a mechanism to evaluate articles trustworthiness relying exclusively on their present state is required.
Alexander identified three basic requirements: objectivity, completeness and pluralism. The first requirement guarantees that the information is unbiased, the second assesses that the information should not be incomplete, the third stresses the importance of avoiding situations in which information is restricted to a particular viewpoint.

Modeling Wikipedia

I won't elaborate on their experiment in detail, but jump straight to the conclusion.
They claim to have proposed a transparent, noninvasive and automatic method to evaluate the
trustworthiness of Wikipedia articles. The method was able to estimate the trustworthiness of articles relying only on their present state, a characteristic needed in order to cope with the changing nature of Wikipedia.

Is Wikipedia TrustWorthy?

Some people, especially academics, are uncomfortable with Wikipedia as a "source" of knowledge. Notwithstanding the regulatory and control mechanisms to prevent "vandalism" of content, there is still skeptism among most academics about how far Wikipedia can be a trustworthy resource.
In COMMUNICATIONS OF THE ACM September 2007/Vol. 50, No. 9, Neil L. Waters explains "Why You Can’t Cite Wikipedia in My Class"
A recent post on SlashDot quotes another IT professor saying:
"People are unwittingly trusting the information they find on Wikipedia, yet experience has shown it can be wrong, incomplete, biased, or misleading"

There was an interesting case recently of a "circular reference" created by Wikipedia. "Ali G", claimed a wikipedia entry, had worked for Goldman Sachs. No sources were given. This found its way into a popular mainstream media journal and Wikipedia became a reference to itself!
http://techdebug.com.nyud.net/blog/2008/04/19/wikipedia-article-creates-circular-references/

Where does that leave us?
I'll leave you with a quote from http://tech.slashdot.org/comments.pl?sid=521670&cid=23103370 (reference link above)
The real Wiki-vandals are the companies, governments and lobby groups of all sorts that flood Wikipedia with their squeaky clean corporate profiles (yes, corporate governments), whipped straight from their websites … These entities are the true threat to the laudable goal of Wikipedia to provide a freely accessible forum for the production and storage of (hopefully well-referenced) articles for the masses and a forum that does not restrict the privilege of contribution to those that have jumped through the all the right hoops. … The printed word is no more reliable than the plasma. Lies may be propagated on Wikipedia, but not without debate. Politicians spouting their sludge find their propaganda sitting side-by-side with those that mock them… If knowing that anything in a Wikipedia article is as likely to be crap as correct, the average reader becomes more vigilant in clicking through to the supporting sources; then Wikipedia has served the purpose of bringing to the masses the healthy skepticism that is, after all, the cornerstone of all academic pursuits.Dark eyes look down from ivory towers. Do they cheer or do they fear?

Visualizing Wikipedia

Chris Harrison, over at the Human-Computer Interaction Institute at Carnegie Mellon University has some beautiful visualizations of Wikipedia's network structure. He calls this project WikiViz.
http://www.chrisharrison.net/projects/wikiviz/index.html
Apart from the stunning visuals, I am thankful to Chris for two ideas that came to me:
- Can Visualizations in the form of Graphs of Wikipedia be analyzed from the point of view of Meme Theory?
- Can I use GraphViz (or similar tools) to develop visualizations for concepts in Religious Texts?
Chris has, in another interesting project, undertaken a visualization of the social network present in the Bible (http://www.chrisharrison.net/projects/bibleviz/index.html).
Similar visualizations for the Gita and Quran should be possible. What would be the motive?
Some vague thoughts swimming in my mind at this stage- Memes, Clusters, Selective Pressures, Evolutionary theories of Culture.

Modeling Wikipedia's Growth

Wikipedia is one of the most interesting phenomena of current times.

One interesting area of study is the modeling of its growth.

If Wikipedia's growth follows the exponential growth model the average rate of growth would be proportional to the size of the Wikipedia. However, it appears that the rate of growth is slowing Maybe Wikipedia's growth follows the logistic growth model better. This model is based on:
- more content leads to more traffic, which in turn leads to more new content
- however, more content also leads to less potential content, and hence less new content
- the limit is the combined expertise of the possible participants.

Interestingly enough, while the number of articles may not be strictly following the exponential curve, we may consider that the quality of articles is, i.e if we assume that the number of edits per article is a measure of its quality.

The graph is plotted in logarithmic scale, and this data also fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 504 days.

Reference: http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia