Big Data in Biology: A Case Study in Computational Proteomics with Python and MongoDB

03:30 PM - 04:00 PM on August 16, 2014, Room 704

Himanshu Grover

Audience level:


Biomedical science is increasingly becoming a quantitative discipline. State-of-the-art technologies can now provide a detailed molecular-level snapshot of an individual, revolutionizing our understanding of disease and fundamental biology. Mass spectrometer is one such enabling technology that profiles (identifies and quantifies) proteins and other biomolecules present in samples like blood, saliva, tissue etc. Making sense of the complex data from a mass spectrometer entails application of sophisticated informatics algorithms. At other times, such research relies heavily on exploratory analyses and visualizations to generate new hypotheses for further investigation. This talk will focus on our efforts to build a scalable framework in Python for large-scale mining of mass-spectrometry datasets. We exploit modern “Big Data” technologies in conjunction with Python’s mature data analytics libraries, to harness these data in novel ways. In particular, MongoDB, a document-oriented database, will be discussed in the context of our informatics applications.


Mass spectrometers generate massive amount of complex proteomic data in a high-throughput manner, challenging traditional analytics systems and workflows. A moderate sized lab can easily generate several GB of raw data every day. Effective utilization of these data archives for deriving context and biological insights relies on the ability to efficiently persist, query and analyze these data using computational and statistical methods. We use MongoDB, a fast and scalable document-oriented storage platform for our data persistence challenges. Through the PyMongo driver, this powerful technology is easily integrated into the Python’s already rich data analytics ecosystem.

MongoDB provides a flexible data model in the form of a set of key-value pairs (called a document), where values can be simple or complex (embedded documents or arrays), thus allowing storage of rich, hierarchical and non-uniform representations that we encounter in many scientific domains, including mass-spectrometry. This simple and intuitive data model directly maps to the Python’s built-in dict data structure, allowing ease of application development.

Along with a full-featured query engine that supports efficient, ad hoc querying over large and distributed datasets, informatics and data mining applications can take advantage of MongoDB’s powerful analytics tools (‘aggregation framework’ and ‘MapReduce’), for simple real-time operations or complex batch operations, in a distributed manner on clusters. Finally, MongoDB can also also act as a data source and/or target for more advanced distributed processing needs using the Hadoop platform. Using this framework, several informatics tasks in mass spectrometry-based proteomics are easily streamlined and parallelized to handle large data workloads.

Overall, MongoDB forms a central piece in our computational infrastructure, standardizing and streamlining data access and analytics operations. Several examples will be presented and discussed demonstrating how we leverage these technologies within our Python codebase for enhanced flexibility, scalability, agility and interactivity, thus simplifying and accelerating the development of novel, data-driven proteomics applications.