Scraping Twitter using Outwit Hub

Students in my graduate unit Philosophies of Communication Technologies and Change (part of our Graduate Certificate in Social Media and Public Engagement) are producing simple lists of tweets.

Some students are using Outwit Hub to generate these lists as this is what I have used since 2012. I have created a guide “Scraping Twitter using Outwit Hub worksheet” for my students but others may also find it useful.

Scraping the results from a Twitter ‘advanced search’ allows you create an archive of tweets without the limitations of the API. It is only useful for relatively small sets that have less than 3,200 tweets per day as you can query Twitter for all tweets for a given hashtag per day.

The lists of tweets shall be used for the purpose of carrying out sophisticated analyses of the ‘circulation of discourse’:

Writing to a public helps to make a world, insofar as the object of address is brought into being partly by postulating and characterizing it. This performative ability depends, however, on that object’s being not entirely fictitious–not postulated merely, but recognized as a real path for the circulation of discourse. That path is then treated as a social entity. (Warner 2002: 64)

The character of this discourse will depend on the stakeholder publics they (or their organisations) wish to engage with and so on.


#thedress for journalism educators

Black and Blue? Gold and White? What does #thedress mean for journalism educators?

The Dress Buzzfeed
Original Buzzfeed post has now had 38 million views.

At the time of writing, the original Buzzfeed post has just under over 38m visitors and 3.4m people have voted in poll at the bottom of the post. Slate created a landing page, aggregating all their posts including a live blog. Cosmo copied Buzzfeed. Time produced a quick post that included a cool little audio slideshowWired published a story on the science of why people see the wrong colours (white and gold). How can we use this in our teaching?

Nearly every single student in my big Introduction to Journalism lecture knew what I was talking about when I mentioned #thedress. I used it as a simple example to illustrate some core concepts for operating in a multi-platform or convergent news-based media  environment.

Multi-Platform Media Event

Journalists used to be trained to develop professional expertise in one platform. Until very recently this included radio, television or print and there was a period from the early to mid-2000s when ‘online’ existed as a fourth category. Now ‘digital’-modes of communication are shaping almost all others. We’ve moved from a ‘platform only’ approach to a ‘platform first’ approach — so that TV journalists also produces text or audio, writers produce visuals, an so on — and what is called a ‘multi-platform’ (or ‘digital first’, ‘convergent’ or ‘platform free’) approach.

When with think ‘multi-platform’, we think about how the elements of a story will be delivered across media channels or platforms:

  • Live – presentations
  • Social – Facebook, Twitter, Youtube, etc.
  • Web – own publishing platform, podcast, video, etc.
  • Mobile – specific app or a mobile-optimised website
  • Television – broadcast, narrowcast stream, etc.
  • Radio – broadcast, digital, etc.
  • Print – ‘publication’

‘Platform’ is the word we use to describe the social and technological relation between a producer and a consumer of a certain piece of media content in the act of transmission or access. In a pre-digital world, transmission or delivery were distinct from what was transmitted.

Thinking in terms of platforms also incorporates how we ‘operate’ or ‘engage’ with content via an ‘interface’ and so on. Most Australians get their daily news from the evening broadcast television news bulletin. Recent figures indicate that most people aged 18-24 actually get their news about politics and elections from online and SNS sources, compared to broadcast TV.

#thedress is a multi-platform media event. It began on Tumblr and then quickly spread via the Buzzfeed post to Twitter and across various websites belonging to news-based media enterprises.  It only makes sense if the viral, mediated character of the event is taken into account.  #thedress media event did not simply propagate, it spread at different rates and at different ways. The amplification effect of celebrities meant #thedress propagated across networks that are different orders of magnitude in scale. Viral is a mode of distribution, but it also produces relations of visibility/exposure.

New News and Old News Conventions

Consumers of news on any platform expect the conventions of established news journalism. What are the conventions of established news journalism?

  • The inverted pyramid
  • The lead/angle
  • Sourcing/attribution
  • Grammar: Active Voice, Tense
  • Punctuation
  • Sentence structure
  • Word use
  • Fairness

When we look at #thedress multi-platform media event we see different media outlets covered the story in different ways. Time magazine wrote the most conventional lead out of any that I have seen; the media event is the story:

Everyone on the Internet Wants to Know What Color This Dress Is
The Internet took a weird turn Thursday when all of a sudden everyone started buzzing about the color of a dress. A woman had taken to Tumblr the day before to ask a seemingly normal question: what color is this dress?

Cosmopolitan largely mediated between the two, both framing the story as an investigation into colour, but also reporting on the virality of the multi-platform media event:

Help Solve the Internet’s Most Baffling Mystery: What Colors Are This Dress?
Blue and black? Or white and gold?
If you think you know what colors are in this dress, you are probably wrong. If you think you’re right, someone on the Internet is about to vehemently disagree with you, because no one can seem to agree on what colors these are.

I’ve only include the head, intro and first par for Time and Cosmo and you can see already they are far more verbose compared to Buzzfeed’s original post. The original Buzzfeed post rearticulated a Tumblr post, but with one important variation:

What Colors Are This Dress?
There’s a lot of debate on Tumblr about this right now, and we need to settle it.
This is important because I think I’m going insane.
Tumblr user swiked uploaded this image.
There’s a lot of debate about the color of the dress.
So let’s settle this: what colors are this dress?
68% White and Gold
32% Blue and Black

The Buzzfeed post added an ‘action’: the poll at the bottom of the post. Why is this important?

Buzzfeed, Tumblr and the Relative Value of a Page View

Buzzfeed COO Jon Steinberg addressed the question of the Buzzfeed business model by posting a link to this article back in 2010:

Some of its sponsored “story unit” ad units have clickthrough rates as high as 4% to 5%, with an average around 1.5% to 2%, BuzzFeed President Jon Steinberg says. (That’s better than the roughly 1% clickthrough rate Steinberg says he thought was good for search ads when he worked at Google.) BuzzFeed’s smaller, thumbnail ad units have clickthrough rates around 0.25%.

The main difference now is the importance of mobile. In a 2013 post to LinkedIN Steinberg wrote:

At BuzzFeed our mobile traffic has grown from 20% of monthly unique visitors to 40% in under a year. I see no reason why this won’t go to 70% or even 80% in couple years.

Importantly, Buzzfeed’s business model is still organised around displaying what used to be called ‘custom content’ and what is now commonly referred to as ‘native advertising’ or even ‘content marketing’ when it is a longer piece (like these Westpac sponsored posts at Junkee).

Image via Jon Steinberg, LinkedIN

On the other hand, Tumblr is a visual platform; users are encouraged to post, favourite and reblog all kinds of content, but mostly images. For example, .gif-based pop-culture subcultures thrive on tumblr and tumblr icons are those that perform gestures that are easily turned into gifs (Taylor Swift) or static images (#thedress).The new owners of Tumblr, Yahoo, are struggling to commercialise Tumblr’s booming popularity.

I had a discussion with the Matt Liddy and Rosanna Ryan on Twitter this morning about the relative value of the 73 million views of the original Tumblr post versus the value of the 38 million views of the Buzzfeed post. Trying to make sense of what is of value in all this is tricky. At first glance the 73 million views of the original Tumblr post trumps the almost 38 million views of the Buzzfeed post, but how has Tumblr commercialised the relationship between users of the site and content? There is no clear commercialised relationship.

Buzzfeed’s business model is premised on a high click-through rate for their ‘native advertising’. Of key importance in all this is the often overlooked poll at the bottom of the Buzzfeed post. Almost 38 million or even 73 million views pales in comparison to the 3.4 million votes in the poll. Around 8.6% of the millions of people who visited the Buzzfeed article performed an action when they got there. This may not seem as impressive an action as those 483.2 thousand Tumblr uses that reblogged #thedress post, but the difference is that Buzzfeed has a business model that has commercialised performing an action (click-through), while Tumblr has not.

Nieman Lab 2015 Predictions for Journalism

Last week I delivered the first lecture in our Introduction to Journalism unit. I am building on the material that my colleague, Caroline Fisher, developed in 2014. One of the things about teaching journalism is that every example has to be ‘up to date’. One of the things that Caroline discussed in the 2014 lecture were the predictions for 2014 as presented by the Nieman Lab.

The Nieman Lab is a kind of journalism think tank, clearing house and site of experimentation. At the end of each year they ask professionals and journalism experts to suggest what they think is going to happen in journalism the next year.

Incorporating these predictions into a lecture is a good way to indicate to students what some professionals and experts think are going to be the big trends, changes and events in journalism for that year. (The anticipatory logic of predictions about near-future events has become a genre of journalism/media content that I briefly discuss in a forthcoming journal article. See what I did there.)

To analyse the the 65 predictions for 2015 in a lecture that only goes for an hour would be almost impossible. What I did instead was to carry out a little exercise in data journalism to introduce students to the practical concepts of ‘analytics’, ‘website scraping’, and the capacity to ‘tell a story through data’.

Nieman Lab
Nieman Lab 2015 Predictions

I created a spreadsheet using Outwit Hub Pro that scraped the author’s name, the title of the piece, the brief one or two line intro and the number of Twitter and Facebook shares. I wanted to know how many times each prediction had been shared on social media. This could then serve as a possible indicator of whether readers though the prediction was worth sharing through at least one or two of their social media networks. By combining the number of shares I could then have a very approximate way to measure which predictions readers of the site had the most value.

Spreadsheet shares
Here is the spreadsheet created through Outwit Hub Pro,

I have uploaded the table of the Nieman Lab Journalism Predictions 2015 to Google Drive. The table has some very quick and simple coding of each of the predictions so as to capture some sense of what area of journalism the prediction is discussing.

The graph resulting from this table indicates that there were four predictions that were shared more than twice the number of times compared to the other 61 predictions. The top three stories had almost three times the number of shares.

combined social shares
The four predictions with the highest number of shares clearly standout from the rest.

Here are the four stories with the total number of combined shares:

  1. Diversity: Don’t talk about it, be about it                              1652
  2. The beginning of the end of Facebook’s traffic engine 1617
  3. The year we get creeped out by algorithms                        1529
  4. A wave of P.R. data                                                                             1339

I was able to then present these four links to my students and suggest that it was worth investigating why these four predictions were shared so many more times than the other 61 predictions.

In the most shared prediction, Aaron Edwards forgoes the tech-based predictions that largely shape the other pieces and instead argues that media organizations need to take diversity seriously:

I guess I could pivot here to talk about the future of news in 2015 being about mobile and personalization. (I would geek out about both immensely.) I suppose I could opine on how the reinvention of the article structure to better accommodate complex stories like Ferguson will be on every smart media manager’s mind, just as it should have been in 2014, 2013, and 2003.
But let’s have a different kind of real talk, shall we?
My prediction for the future of news in 2015 is less of a prediction and more of a call of necessity. Next year, if organizations don’t start taking diversity of race, gender, background, and thought in newsrooms seriously, our industry once again will further alienate entire populations of people that aren’t white. And this time, the damage will be worse than ever.

It was a different kind of prediction compared to the others on offer. Most people who work in the news-based media industry have been tasked with demonstrating a permanent process of professional innovation. Edwards piece strips back the tech-based rhetoric and gets at the heart of what media organizations need to be doing so as to properly address all audiences.  “The excuse that it’s ‘too hard’ to find good journalists of diverse backgrounds is complete crap.”

The second most shared piece, on the limitations of over-relying on Facebook as a driver of traffic, fits perfectly with the kind of near-future prediction that we have come to expect. Gnomic industry forecasting flips the causal model with which we are  familiar — we are driven by ‘history’ and it is the ‘past’ (past traumas, past successes, etc) that define our current character — so that it draws on the future as a kind of tech-mediated collective subconscious. Rather than being haunted by the past, we are haunted by possible futures of technological and organisational change.

My favourite piece among all the predictions is Zeynep Tufekci who suggests that things are going to get weird when our devices start to operate as if animated by a human intelligence. She suggests that “algorithmic judgment is the uncanny valley of computing“:

Algorithms are increasingly being deployed to make decisions where there is no right answer, only a judgment call. Google says it’s showing us the most relevant results, and Facebook aims to show us what’s most important. But what’s relevant? What’s important? Unlike other forms of automation or algorithms where there’s a definable right answer, we’re seeing the birth of a new era, the era of judging machines: machines that calculate not just how to quickly sort a database, or perform a mathematical calculation, but to decide what is “best,” “relevant,” “appropriate,” or “harmful.”

ebooks: or the

Appending ebooks to something is a practice belonging to subcultures on twitter and derived from the meme surrounding the horse_ebooks twitter account. Here are some notes on the cultural meaning of ‘_ebooks’.


[] [] []


Various ‘_ebooks’ accounts have been created. What they all have in common is the algorithmic act of sampling source material and turning it into a tweet. On the process behind horse_ebooks:

The algorithm that produces the horse_ebooks stream, like most spammic algorithms, relies on user interaction to grow more effectual. It interprets text as data, and determines which keywords might best promote an outcome like the sale of Cialis or Horse Medical Records. Just as with many of our more popular and less insidious internet applications, the more interaction the algorithm gets, the smarter it becomes. The growing popularity of horse_ebooks has reciprocally allowed it to become better at generating tweets like “The Fear Of lowlife criminals With Environmental Protection” (October 30, 2011; 49 retweets).


While it is true that the algorithm publishes tweets that were, at some point, somehow, written, it is non-author in the sense that it defies the binary border between ebook and reader. It imagines non-authorship in a way that even social media, with its dissolution of anonymity, can enable.

Sure. There is something else going on when _ebooks is appended to non-_ebooks; that is, something that is ostensibly not the algorithmic poetics actioned in the event-space between discourse and code. Something has been extracted from the _ebooks phenomena and has now been folded back into the social practices on social media.

[] [] []

What has been extracted? (Or what makes ‘_ebooks’ singular?) It is something that, firstly, plays with the relation between sense and nonsense. horse_ebooks enthusiasts are sometimes criticised for anthropomorphising the algorithm-based expressions. The non-discursive semantic sampling of source material is an algorithmic variation of the creative/aesthetic practice of producing and exploding disjunction. Contra the critics, this does not foreclose the possibility of meaning, only that the discursive dimension of the sample has also been parsed by the algorithm. What is this discursive dimension? The incorporeal materialism of all language. [1. See Foucault’s “Discourse on Language” on ‘incorporeal materialism’:  “If discourses are to be treated first as discursive events, what status does this notion of event have? Of course, an event is neither substance, nor  accident, nor quality nor process; events are not corporeal. And yet, an event is certainly not immaterial; it takes effect, becomes effect, on the level of materiality. Events consist in relation to, coexistence with, dispersion of, the cross-checking accumulation and the selection of material elements; it occurs as an effect of, and in, material disperion.”]

So a little nugget of sense emerging from non-sense. From the perspective of information theory, this is clearly irrational, because signals do not simply emerge from what is ostensibly noise, noise impinges on signals and so on. The meaning produced by _ebooks twitter accounts is (unintentionally?) meaningful but in a quasi-random manner. ‘Random’ because it is derived from the sampling algorithm, ‘quasi’ because it relies on coded text that otherwise belongs to the logical systems of language. That is, the _ebooks are never quite ‘noise’ because they are, at a minimum, sensible as nonsense.

The _ebooks tweets exist not just as a ‘text’, however. They are better understood as an event. Techniques and technologies of representation (language, media, etc.), like all kinds of communication, are forms of transport. [2. Raymond Williams was very clear about this in his ‘Communication’ entry in the iconic Keywords — where it can mean ‘transmit’ or ‘share’.] Representation brings a time and place into contact with another time and place. The singular quality of this contact is the event of sense. Practices on twitter materially enact this process of representational transport. Retweets are ambiguous, ‘favourites’ are less so.

The practical dimension of ‘retweets’ and ‘favouriting’ modifies the relations of visibility and the relations of valorisation inherent in all acts of communication. The normative content of a tweet does not have to be the content that is valorised; rather, more sophistcated twitter users often retweet in an ironic fashion. Twitter users can choose to participate in these processes of transportation by retweeting, this is obvious; less obvious is the purpose of retweeting ostensibly nonsensical tweets. In the passage of the retweet — the ‘journey’ of the communicational transport — what is gained or lost?


[] [] []


One of Tim Lampe’s Horse E-Posters:

For a long time subcultural groups have created entire languages of meaning that appear to be nonsensical to outsiders. This is in part happening here as ‘_ebooks’ is a syntactic morpheme belonging to denotational practices of twitter-based subcultures. Retweeting can be understood as a practice of citation; think of that bloke everyone went to school with who knew every single line from the Simpsons. Citing the Simpsons produced a measure of cultural cache as a performance of cultural taste.

Retweeting does something similar, but with an additional dynamic dimension. The political economy of belonging in online networks not only means ‘following’ the right ‘people’ (or emitters of becoming-sensical content), but also of participating in the passage of meaning as meaning itself is enacted. Not only is the content shared, as per Raymond Williams’ definition of communication, but the becoming-sensical of the content is also shared.

Think of a joke that develops over the course of an evening. The release of tension signalled by the smile (weak) or laughter (strong) is triggered by a disjunction that produces the affective tension present in all humour. [3. “A horse_ebooks walks into a bar.” “The barmen says, “.] Such jokes cascade, but they are also repeated other nights, just as the possibility of such a joke developing is repeated. The algorithmic disjunction of sample text of the original ‘_ebooks’ twitter accounts is pregnant with a similar potentiality.


[] [] []


What does ‘_ebooks’ represent?

What happens when ‘_ebooks’ is appended to something?

It is an ironic signifier. Instead of signifying the becoming-sensical of the original algorithmic ‘_ebooks’ twitter acounts, it is signifying the (allegeded) becoming-nonsensical of an actual person’s expression.



Frank and Robot


Go and see it.

Great track. “Fell On Your Head” by Francis and the Lights.

The point isn’t whether or not he was going to kill himself, it was that he had a moment of lucidity — and he wanted to share that with his kids; and when he was debating whether or not to erase the robots memory it was a realisation that if he did, he was being deleted; a metaphor for his state of being — he wasn’t living, if he couldn’t remember.