We work with data from project partners in two ways:
1. Some partners prefer we do our work on their systems. From the partner’s perspective, this may have significant benefits: The partner retains control of the data, and it is easier to deploy our work at the end of the project.
Partners who choose this approach need to provide us with the computational resources necessary to handle our machine-learning pipeline. For most projects, we can do well with 2-4 cores, 16-32 GB of RAM, and 500 GB of disk space. The more computational resources we get, the faster we can build good models.
We use all free amd open-source software, including the following:
We use linux command-line tools
Python (numpy, pandas, scipy, scikit-learn at a minimum)
Postgres. We can use other database systems, but it will slow our work.
2. Most of them givs us an extract/copy of their internal data. We have strict protocols and security procedures for protecting the privacy and confidentiality of the data given to us. When giving us an extract of your data, many of our partners have worked with universities and have standard procedures for extracting and cleaning data for academic use. While those procedures might work well for one-off research projects, it doesn’t work well for the types of projects we work on and our goal of giving you a working system back.. Most DSaPP projects aim to build software that works on our partners’ systems, even after we stop working together. For that to happen, we typically need direct access to the partner’s computer system (where we log into the system as any employee would and do our work there) or a database dump.