Joyce Peng Beijing Genomics Institute Americas Corporation How many of you know about BGI? We are the largest genomic institute in the world. We have 4000 employees in Shenzhen, China. We are a non-profit organization and we do commercial services as well. For the purpose of basically having funds that, feeding the research. All the proceeds from the services are used to fund research and to benefit mankind. BGI Americas is based in the U.S. and headquartered in Boston. It is the interface with customers and collaborators in Americas. Customers include 15 of the largest top 20 pharmas in the world. BGI has incredible wealth of resources. We have over 160 next gen sequencers which at one point was larger than the entire capacity of the US. We can do 2k human genomes per day or about 6 terabytes per day. We have multiple supercomputing centers to manage this data. We do our solutions and research in transcriptomics, metagenomics, epigenomics and a lot of different sequencing approaches as well as bioinformatics. These are the science covers that are part of our collaborative projects. We are core participants in all of these international projects. With the internet and the next-generation sequencing generates a lot of data. This creates the challenge of being able to create and share this data. We want to quickly talk about what we did and free public sharing is always an important principle at BGI. So we sequenced the rice enome in 2002. In 2005 we followed it up. We made the decision to make it publicly available. This can effect the field of rice research. This graph shows that before the publication, wheat had more papers. After our publication, rice rose and took off. This made a huge difference. This was a study done by comparing clinical trial data. This group of- these are the papers that came with the microarray data. These are not.. there is a significant difference of the citation rate. With the microarray data, the papers are cited more. With those thoughts in mind, we launched a new journal called GigaScience. That's trying to address a prbolem because we have so many collaborations internationally and there is no publication or home for the data generated. With this GigaScience, it will focus on large scale data with both the publication and the database. Both of them will be published and the journal will host the data and find out ways to share the data publicly. Today I want to quickly talk about a recent story that we are deeply involved in with the ecoli outbreak in Germany. This was May-July 2011. There were oven 4k cases and 40 deaths. For people who got infected by ecoli, there was a high case of kidney failure and somehow females were particularly at risk. Cucumbers were initially blamed and then later it was found that sprouts were the source. Because of these, there were $600 million loss in European farmers and this was a big impact of these raw identification. What BGI did was- the Germans- when they first found the sampel, and they sought us and immediately sent us the sample. We did the sequencing, made a map of it and completed in 3 days. After that, we did the third-generation sequencing using an IonTorrent and a personal genome project. We did a 2nd-generation illumina high-c. And then we did a first generation standard sequencing for validation. If you are interested in next-gen sequencing, 3rd-gen is fast, 2nd-gen is thorough, and Sanger (1st-gen) is good for validation. They all have pro's and con's and are used for different pruposes. We received DNA from a 16 year old patient. We used IonTorrent to sequencei t first. The major finding was that we defined the genome, type it and we were able to trace the region of it. It's similar to the strand from an HIV patient from central Africa in 2001 and then we did an analysis and 8 antibiotic resistance genes. This is the second step and the illumina machine for a more complete map. We were able to classify it and what caused the problem as well. This is the genome. TY-2482 genome. This is the larger chromosome. Within five days we designed a diagnostic protocol based on PCR. We used two sets of primers that allows us to know the results and to see if the strand or if something is contaminated within two to three hours. So we tested the specifity of the protocol and then we look at 4547 strains. We also did experimental analysis as well and then we found that this was highly specific. This protocol was specific to 1 pictogram of DNA. We made the primers available to the public so that it was used in a lot of tests in hospitals. Here's a link for the protocol. ftp://ftp.bj... something something.. dot pdf The whole process was with a lot of help from people around the world. We made the sequence available through many different means like twitter and other sharing places. This is an example of how it was done. E. coli O104:h4 genome analysis crowdsourcing (github) https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis this isn't the same thing she showed, though?