Microsoft Research (MSR) is the research subsidiary of Microsoft, formed in 1991 with the intent to advance state-of-the-art computing and solve difficult world problems through technological innovation in collaboration with academic, government, and industry researchers. The Microsoft Research team employs more than 1,000 computer scientists, physicists, engineers, and mathematicians, including Turing Award winners, Fields Medal winners, MacArthur Fellows, and Dijkstra Prize winners.
MSR has numerous scientists and partners working tasks involving very large datasets. While some are as small as 100MB, the datasets can also be as large as 50+TB. These datasets are also scattered across divisions, making discovery and usage difficult.
To address these issues, Microsoft Research approached Wintellect to create the Microsoft Research Open Data portal web site. This site allows users to discover and select interesting data and then quickly provision it to an Azure cloud environment for further analysis. Longer term, the vision is to open the portal to contributions and consumption from outside of Microsoft Research, with the primary target audiences being data scientists, researchers, and colleges and universities. You can visit the site here: https://msropendata.com/.
The Microsoft Open Data portal design makes it easy to identify datasets by specific criteria, or to search for datasets based on descriptions and characteristics. Once datasets are found, individual files can be previewed in a browser and downloaded directly to the user’s computer. A user can also get a link that allows the data to be directly consumed from Azure blob storage, making it easy to utilize any preferred toolset. Longer term plans include the possibility to provision and deploy data directly to virtual machines. In order to download a file, or consume the dataset in Azure, the user would accept an associated license agreement. The user’s acceptance of the license is then tracked for compliance.
From a technical perspective, the Open Data Repository code is hosted, built, and deployed using Visual Studio Team Services. It uses a combination of tools and technologies and is hosted in the Azure public cloud. These technologies include:
- Cosmos DB: Maintains catalog information about the datasets and their contents.
- Azure Search: To assist in detail and textual searches of dataset properties.
- Azure App Services/Web Sites: To host the web site and administrative tools.
- Azure Batch: To process datasets, including cataloging and creating compressed archives of datasets.
- Azure B2C / MSAL.js: To authenticate users via the web application.
- Application Insights: Tracks usage of the web applications and individual datasets.
- SendGrid: To deliver notifications of new dataset nominations.
- Azure Key Vault: To control access to secure Azure resources.
- Azure Blob Storage: Stores the content of the datasets.
- .NET Core: To implement the web application and other related tools.
- ARM templates: To define Azure resources and ensure consistent deployments as well as create “big data” resources within a user’s own Azure subscription.
- Angular: To implement the user interface for the web application.
Microsoft needed a partner that was expert at creating enterprise web applications integrated with Azure PaaS and Data/AI services. Wintellect is a Microsoft Gold Cloud Platform, Gold DevOps, and Gold Data Platform partner, as well as an AI Inner Circle partner and Advanced Analytics Training partner. We are recognized as leaders in complex application development and cloud architectures.
The solution was developed and implemented in several phases, each accomplished within the planned timeframes and budgets. It was developed iteratively, with continuous feedback from stakeholders to ensure the system was delivering the required functionality. The initial deployment was hosted by Wintellect, and then later migrated to the Azure cloud using the App Service Migration tool. A later version added monitoring and alerting capabilities using App Insights. The overall system was also implemented with the assumption that the “big data” technology scene is changing rapidly, and with an architecture in the place to flexibly adapt to future offerings within Azure, including advances in Data Lake technologies.