Assistant Professor Matthew Weber and Colleagues Host Datathon at the Library of Congress

January 11, 2017

Weber’s work in this area has put SC&I at the forefront of a critically important international research endeavor.

Weber's Datathon at the Library of Congress.

Here’s a mind twister: Libraries house, in the form of web archives, vast collections of incredibly rich and valuable research material. However, many of these web archives are not currently accessible, and often librarians are collecting them with the simple hope that they’ll be used in the future. Imagine the impact on global scholarship if all of this data were readily available to researchers and scholars.

Striving to come up with a way to make this happen are SC&I’s Assistant Professor Matthew Weber and two of his colleagues, Jimmy Lin (Computer Science, University of Waterloo) and Ian Milligan (History, University of Waterloo). Part of their solution to this problem: hosting cutting–edge events called datathons. Weber’s work in this area has put SC&I at the forefront of a critically important international research endeavor.

Weber, Lin and Milligan recently hosted their second datathon (a datathon is the same thing as a hackathon) at the Library of Congress in Washington, D.C. Held on June 14 and 15, 2016, and titled “Archives Unleashed 2.0,” this datathon enabled participants to get “involved with web collections. Libraries are creating huge research repositories - but scholars aren’t using the data yet. This datathon enabled us to push the limits of large-scale data analysis and high performance computing,” Weber explained, “It’s amazing what we’re able to accomplish in a two-day period.”

This was the second datathon Weber and his colleagues held. The first, “Archives Unleashed 1.0” was held at University of Toronto and hosted by SC&I, the University of Waterloo, the University of Toronto, and Université du Québec. The same group came back together to host the second event. Both datathons were supported with funding from the National Science Foundation (NSF) and the Social Sciences and Humanities Research Council of Canada.

“This is a growing community,” Weber said. “In the last two years, the community of scholars working with large-scale archived data has grown into the hundreds. I received an NSF grant four years ago to work on building this community - and its grown significantly in recent years. The datathons are - in part - the outcome of the work my team has been leading to build this community.”

The second datathon was held at the Library of Congress because, as Weber said, “It’s the Library of Congress! How inspiring is that? We couldn’t think of a better place to showcase the intersection of digital archives, new technology, data infrastructure challenges than the Library of Congress!” In addition, Weber explained, “The Library of Congress wanted to see what could be done with their data. They are interested in getting researchers involved with web archives - and this was a great way to accomplish that goal.” For example, Weber points out that digital archives of Supreme Court data explored at this datathon had never been released before.

Weber and his colleagues issued a call for applications for the datathon, researchers applied to participate, were selected by a committee and given travel funding, and then, as Weber said, they “let them loose to explore the web collections and test cutting edge technologies that could make the data accessible (such as Warcbase - a special package for analyzing archived web data).”

Among the 75 researchers at the Library of Congress datathon, who were accepted after a rigorous application process (CFPs were sent out to a number of lists, and students applied), were two students from SC&I: Teis Kristensen and Allie Kosterich (Kristensen was also at the first event).

The datathon was followed by a symposium on “Saving the Web,” where the keynote was given by Vint Cerf (Chief Innovation Officer at Google and co-inventor of the TCP/IP protocol, “the backbone of the internet”). The winning presentations from the datathon were also presented at the symposium.

Explaining the origin of the idea to host datathons, Weber said, “The idea is something we collectively came up with a year ago during a conference we attended at Columbia University. We had been having a discussion about the challenges associated with working with large-scale data - especially archived web data - and training graduate students to work in this space, We thought a datathon would be a great way to get students involved.”

Back at SC&I, Weber is planning to create a new course for SC&I’s Master of Communication and Information Studies (MCIS) and Master of Information (MI) Programs which, as Weber says, will be “in the area of digital web archiving - this would be a direct translation of lessons learned from datathon.”

Looking ahead, Weber, Lin and Milligan have already begun planning future events. “We’re planning Archives Unleashed 3.0 at Stanford and the Internet Archive in February 2017 and Archives Unleashed 4.0 at the British Library in June,” Weber said.