Contents
INTRODUCTION
Your datascience project has grown to a certain level and is gaining momentum. But you are facing the fact that releases are increasingly delayed, the number of bugs is increasing, and the team cannot cope with them. Let's figure out what can be done about this?
Usually such problems arise when transitioning a project to preproduction or production mode from rapid prototyping mode due to the need to change the organizational and technical basis of the project. In this article, we have tried to highlight and classify the main points that, in our opinion, will help to restructure the project, change the trend from accumulating technical debt to reducing it, and also make the technical part of the project predictable and manageable again for management.
ASPECTS RELATED TO THE PROJECT CODE
Often datascience specialists care little about code quality and/or performance. Such an approach is quite justified at the research stage, when creating an MVP or when the code is written by a single person (without a team). However, with the development of the project, with the expansion of the team or when preparing production code, such "creative disorder" can lead to problems.
To be fair, it should be said that working with the code is more focused on the medium and long term, as the amount of work is usually large and the results are hard to achieve in a week or two. But nevertheless, if you do not start dealing with it systematically, the project may become unmanageable.
Below we will focus on some specific examples when the help of an advanced Python developer with datascience expertise is very timely.
1) Bringing datascience code to production-quality state
***As a rule, datascience specialists are primarily scientists, not Python developers. They also rarely work in large teams, where more attention is paid to code quality.***
- using meaningful names for variables and functions
- using OOP (for example, inheriting from base classes)
- following the DRY principle
- exception handling
- running python scripts with parameters from the command line
- data type annotation
- documenting functions and commenting code
- checking code against generally accepted standards (PEP-8) using linters and code formatters
2) Tests
***A rare guest in datascience code. The role of tests increases in the long term and when creating complex systems.***
- using the capabilities of test frameworks (pytest, unittest)
- using coverage to track the level of code coverage by tests
3) Performance Optimization
***When transferring code to a server, performance issues that did not occur during local development may arise.***
- code profiling (memory profiling + cpu profiling)
- optimizing python code (best practices for numpy, scipy, pandas)
- optimizing AI/ML/DeepLearning components (migration to more modern libraries)
- using the Intel Math Kernel Library (MKL) library
4) GPU Calculations
***GPUs are used to accelerate computations.***
- solving accompanying memory management problems (especially when using tensorflow)
5) Creating a Python library based on the code
***Often there is a situation when the same datascience code is needed in different customer projects, i.e., so that the code can simply be imported.***
- creating library elements (pyproject.toml, requirements.txt, etc.)
- publishing and installing from the directory or installing from the repository with code
6) Integration with Cloud Services
***Many production solutions are based on the use of Cloud services.***
- uploading/downloading files to/from AWS S3 or Google Drive directly from code
7) Interaction with the DevOps Team (git/jenkins/docker)
***When integrating Jenkins/Sonarqube, when creating a Dockerfile, when configuring microservices, developer participation is almost always necessary.***
- configuring GitHub/GitLab checks
- creating Dockerfile/Jenkinsfile
- forming versioning policy
Example 1
In the project, it was necessary to minimize the time between sending the text of the message to the beginning of the audio streaming of this message (Text-to-Speech (TTS) process). Optimization of the python code, parallelizing the process had a good effect, however, it was still not enough. MKL libraries came to the rescue, when without changing the code, the TTS process time was reduced by a third, and the performance of individual matrix calculations improved by 60-80%.
ELABORATION/REVISION OF PROJECT INFRASTRUCTURE
A serious drawback and a risk generator can be suboptimal infrastructure of the project.
In our opinion, during the infrastructure audit, it is necessary to pay attention to the following points:
1) The need to forecast infrastructure load
***Sooner or later, any growing service has to assess its technical capabilities. First of all, this concerns the largest and most expensive services to scale.***
- conducting load testing
- using CDN to offload
2) If the projected hosting costs exceed $100,000 a year, an in-depth analysis of infrastructure usage modes is essential
***Various kinds of limits may annoy you, but in times of economy, they will save from accidental overconsumption and unplanned expenses.***
- more modest configuration
- quotas on resources (upper consumption limit)
- fine-tuning specific services (databases, message brokers, search engines, etc.)
- more complex solutions involving DevOps engineers (for example, implementing a data operations schedule considering the load)
3) The right approach to scaling
***Do you need AWS, GCE, or another cloud provider now? Yes, it's trendy, youthful, scalable, but costs 10 times the solutions based on hosting at hetzner.com.***
- assessment of potential traffic (analysis of activity spikes in the past and forecasting them in the future)
- if your load implies 5000 simultaneous users, maybe it's worth choosing a cheaper provider (while preparing for a quick transition to a scalable solution in case of project success)
4) Infrastructure security audit
***It may be funny, but passwords like “123456” or “admin” are still used, significantly reducing security.***
- backup (frequency and rules)
- passwords and access keys
- server uptime
- physical security of servers in data centers and hosting guarantees in the event of emergencies
5) Infrastructure portability audit
***Yes, everyone loves trendy stuff. Programmers, analysts are no exception. But have you thought about what will happen to your project on Big Query, for example, if you fall under sanctions? Or what prevents cloud providers from raising the hosting price by 10 times in conditions of electronics deficit? It would be nice to consider such risks as well.***
- an alternative to western cloud services can be for example sbercloud
6) Creation of a Monitoring System
***An effective monitoring system allows for the rapid identification of server failures.***
- server and network monitoring (if necessary, use of own metrics, as well as configuration considering geo-distributed infrastructure) using prometheus and/or zabbix
- code error monitoring (sentry is great for python projects)
Example 2
It was suddenly found in the project that a significant percentage of allocated funds is required to support the infrastructure. To save money, the option of reducing the development team was seriously considered. However, the project had not yet reached its peak load, and a solution was found by switching to a cheaper cloud provider (albeit at the expense of scalability). This way, time was won for the full development team to find a more rational approach to hosting the infrastructure.
ORGANIZATIONAL DECISIONS
A very important point, but we will leave it as a topic for the next article.
HOW CAN WE HELP?
Do you want your datascience project to be better? Our team is confident that it will help bring order to your project: prepare the code for production and optimize infrastructure costs! Among the collaboration options can be both the provision of individual specialists to join the project for a long time, and a one-time audit. It is also possible to provide a team of specialists.
The approximate algorithm of our actions, depending on the needs of the customer, may include:
- Signing an NDA
- Preliminary code audit (3 hours of developer time, free) to assess the assumed further time costs
- Code audit (40 - 80 hours of senior developer time)
- Infrastructure audit (20 - 40 hours of System Architect + DevOps engineer time)
- Preparation of an action plan aimed at improving the state of the project (10 - 20 hours)
- Implementation of the plan
You will need to provide access to the code and infrastructure of the project, as well as provide load data.
Have a question? Ask!
If you need help organizing work on a Data Science project, you can get a free consultation from our CTO
Готова ли Ваша IT-инфраструктура к новому этапу развития?
Если Вы не можете уверенно ответить "Да", то мы предлагаем задать этот вопрос нашим специалистам и оценить ситуацию вместе. Получите консультацию!