Data Science Learning Roadmap for 2021

Data Science Learning Roadmap for 2021

Building your own learning track to master the art of applying data science

Image for post

Although nothing really changes except for the date, a new year fills everyone with the hope of starting things afresh. Adding a bit of planning, well-envisioned goals and a learning roadmap makes for a great recipe for a year full of growth.

This post intends to strengthen your plan by providing you with a learning framework, resources, and project ideas to build a solid portfolio of work showcasing expertise in data science.

Disclaimer:
The roadmap defined is prepared based on my little experience in data science. This is not the be-all and end-all learning plan. The roadmap may change to better suit any specific domain/field of study. Also, this is created keeping python in mind as I personally prefer to use python.

What is a learning roadmap?

In my humble opinion, a learning roadmap is an extension of a curriculum that charts out multi-level skills map with details on what skills you want to hone, how you will measure the outcome at each level, and techniques to further master each skill.

My roadmap assigns weights to each level based on the complexity and commonality of application in the real-world. I have also added an estimated time for a beginner to complete each level with exercises/projects.

Here is a pyramid that depicts the high-level skills in order of their complexity and application in the industry.

Image for post

This would mark the base of our framework, we’ll now have to deep dive into each of these strata to complete our framework with more specific, and measurable details.

Specificity comes from enlisting the critical topics in each stratum and resources to refer to master those topics.

We’d be able to measure it by applying the learned topics to a number of real-world projects. I’ve added a few project ideas, portals, and platforms that you can use to measure your proficiency.

Imp NOTE: Take it one day at a time, one video/blog/chapter a day. It is a wide spectrum to cover. Don’t overwhelm yourself!

Let’s deep dive into each of these strata, starting from the bottom.

1. Programming or Software Engineering

(Estimated time: 2-3 months)

Firstly, make sure you have sound programming skills. Every data science job description would ask for programming expertise in at least one of the languages.

Specific topics include:

  • Common data structures(data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries.
  • SQL scripting: Querying databases using joins, aggregations, and subqueries
  • Comfortable with using the Terminal, version control in Git, and using GitHub

Resources for python:

  • learnpython.org [free]— a free resource for beginners. It covers all the basic programming topics from scratch. You get an interactive shell to practice those topics side-by-side.
  • Kaggle [free]— a free and interactive guide to learning python. It is a short tutorial covering all the important topics for data science.
  • Python Course by freecodecamp on YouTube[free] — This is a 5-hour course that you can follow to practice the basic concepts.
  • Intermediate python [free]— Another free course by Patrick featured on freecodecamp.org.
  • Coursera Python for Everybody Specialization[fee] — this is a specialization encompassing beginner-level concepts, python data structures, data collection from the web, and using databases with python.

Git

  • Guide for Git and GitHub[free]: complete these tutorials and labs to develop a firm grip over version control. It will help you further in contributing to open-source projects.

SQL

Measure your expertise by solving a lot of problems and building at least 2 projects:

  • Solve a lot of problems here: HackerRank(beginner-friendly), LeetCode(solve easy or medium-level questions)
  • Data Extraction from a website/API endpoints — try to write python scripts from extracting data from webpages that allow scraping like soundcloud.com. Store the extracted data into a CSV file or a SQL database.
  • Games like rock-paper-scissor, spin a yarn, hangman, dice rolling simulator, tic-tac-toe, etc.
  • Simple web apps like youtube video downloader, website blocker, music player, plagiarism checker, etc.

Deploy these projects on GitHub pages or simply host the code on GitHub so that you learn to use Git.

2. Data collection and Wrangling(Cleaning)

(Estimated time: 2 months)

A significant part of the data science work is centered around finding apt data that can help you solve your problem. You can collect data from different legitimate sources — scraping(if the website allows), APIs, Databases, publicly available repositories.

Once you have data in hand, an analyst will often find herself cleaning dataframes, working with multi-dimensional arrays, using descriptive/scientific computations, manipulating dataframes to aggregate data.

Data is rarely clean and formatted for use in the “real world”. Pandas and NumPy are the two libraries that are at your disposal to go from dirty data to ready-to-analyze data.

As you start feeling comfortable writing python programs, feel free to start taking up lessons on using libraries like pandas and numpy.

Resources:

Project Ideas:

  • Collect data from a website/API(open for public consumption) of your choice, collect the data, and transform the data to store data from different sources into an aggregated file or table(DB). Example APIs include TMDB, quandl, Twitter API, etc.
  • Pick any publicly available dataset; define a few set of questions that you’d want to pursue after looking at the dataset and the domain. Wrangle the data to find out answers to those questions using pandas and NumPy.

3. EDA, Business acumen and Storytelling

(Estimated time: 2–3 months)

The next stratum to master is data analysis and storytelling. Drawing insights from the data and then communicating the same to the management in simple terms and visualizations is the core responsibility of a Data Analyst.

The storytelling part requires you to be proficient with data visualization along with excellent communication skills.

Specific topics:

  • Exploratory data analysis — defining questions, handling missing values, outliers, formatting, filtering, univariate and multivariate analysis.
  • Data visualization — plotting data using libraries like matplotlib, seaborn, and plotly. Knowledge to choose the right chart to communicate the findings from the data.
  • Developing dashboards — a good percent of analysts only use Excel or a specialized tool like Power BI and Tableau to build dashboards that summarise/aggregate data to help the management in making decisions.
  • Business acumen: Work on asking the right questions to answer, ones that actually target the business metrics. Practice writing clear and concise reports, blogs, and presentations.

Resources:

Project Ideas

4. Data Engineering

(Estimated time: 4–5 months)

Data engineering underpins the R&D teams by making clean data accessible to research engineers and scientists at big data-driven firms. It is a field in itself and you may decide to skip this part if you want to focus on just the statistical algorithm side of the problems.

Responsibilities of a data engineer comprise building an efficient data architecture, streamlining data processing, and maintaining large-scale data systems.

Engineers use Shell(CLI), SQL, and python/Scala, to create ETL pipelines, automate file system tasks, and optimize the database operations to make it high-performance. Another crucial skill is implementing these data architectures which demand proficiency in cloud service providers like AWS, Google Cloud Platform, Microsoft Azure, etc.

Resources:

Project Ideas/Certifications to prepare for:

  • AWS Certified Machine Learning(300 USD) — A proctored exam offered by AWS, adds some weight to your profile(doesn’t guarantee anything though), requires a decent understanding of AWS services and ML.
  • Professional Data Engineer — Certification offered by GCP. This is also a proctored exam and assesses your abilities to design data processing systems, deploying machine learning models in a production environment, ensure solutions quality and automation.

5. Applied statistics and mathematics

(Estimated time: 4–5 months)

Statistical methods are a central part of data science. Almost all the data science interviews predominantly focus on descriptive and inferential statistics.

People start coding machine learning algorithms without a clear understanding of underlying statistical and mathematical methods that explain the working of those algorithms.

Topics you should focus on:

  • Descriptive Statistics — to be able to summarise the data is powerful but not always. Learn about estimates of location(mean, median, mode, weighted statistics, trimmed statistics), and variability to describe the data.
  • Inferential statistics — designing hypothesis tests, A/B tests, defining business metrics, analyzing the collected data and experiment results using confidence interval, p-value, and alpha values.
  • Linear Algebra, Single and multi-variate calculus to understand loss functions, gradient, and optimizers in machine learning.

Resources:

  • [Book]Practical statistics for data science(highly recommend) — A thorough guide on all the important statistical methods along with clean and concise applications/examples.
  • [Book]Naked Statistics — a non-technical but detailed guide to understanding the impact of statistics on our routine events, sports, recommendation systems, and many more instances.
  • Statistical thinking in Python — a foundation course to help you start thinking statistically. There is a second part to this course as well.
  • Intro to Descriptive Statistics— offered by Udacity. Consists of video lectures explaining widely used measures of location and variability(standard deviation, variance, median absolute deviation).
  • Inferential Statistics, Udacity — the course consists of video lectures that educate you on drawing conclusions from data that might not be immediately obvious. It focuses on developing hypotheses and use common tests such as t-tests, ANOVA, and regression.

Project Ideas:

  • Solve the exercises provided in the courses above and then try to go through a number of public datasets where you can apply these statistical concepts. Ask questions like “Is there sufficient evidence to conclude the mean age of mothers giving birth in Boston is over 25 years of age at the 0.05 level of significance.”
  • Try to design and run small experiments with your peers/groups/classes by asking them to interact with an app or answer a question. Run statistical methods on the collected data once you have a good amount of data after a period of time. This might be very hard to pull off but should be very interesting.
  • Analyze stock prices, cryptocurrencies, and design hypothesis around the average return or any other metric. Determine if you can reject the null hypothesis or fail to do so using critical values.

6. Machine Learning / AI

(Estimated time: 4–5 months)

After grilling yourself through all the major aforementioned concepts, you should now be ready to get started with the fancy ML algorithms.

There are three major types of learning:

  1. Supervised Learning — includes regression and classification problems. Study simple linear regression, multiple regression, polynomial regression, naive Bayes, logistic regression, KNNs, tree models, ensemble models. Learn about evaluation metrics.
  2. Unsupervised Learning — Clustering and dimensionality reduction are the two widely used applications of unsupervised learning. Dive deep into PCA, K-means clustering, hierarchical clustering, and gaussian mixtures.
  3. Reinforcement learning(can skip*) — helps you build self-rewarding systems. Learn to optimize rewards, using the TF-Agents library, creating Deep Q-networks, etc.

The majority of the ML projects need you to master a number of tasks that I’ve explained in this blog.

Resources:

Deep Learning Specialization by deeplearning.ai

For those of you who are interested in further diving into deep learning can start off by completing this specialization offered by deeplearning.ai and the Hands-ON book. This is not as important from a data science perspective unless you are planning to solve a computer vision or NLP problem.

Deep learning deserves a dedicated roadmap of its own. I’ll create that with all the fundamental concepts and

Track your learning progress

Image for post

I’ve also created a learning tracker for you on Notion. You can customize it to your needs and use it to track your progress, have easy access to all the resources and your projects.

Find the video version of this blog below!

Data Science with Harshit

This is just a high-level overview of the wide spectrum of data science and you might want to deep dive into each of these topics and create a low-level concept-based plan for each of the categories.

Feel free to respond to this blog or comment on the video if you want me to add a new topic or rename anything. Also, let me know which category would you like me to do project tutorials on.

You can connect with me on Twitter or LinkedIn.

2021년 1월 17일 수원시 부동산 경매 데이터 분석

2021년 1월 16일 모바일 게임 매출 순위

Rank Game Publisher
1 리니지M NCSOFT
2 리니지2M NCSOFT
3 세븐나이츠2 Netmarble
4 Genshin Impact miHoYo Limited
5 기적의 검 4399 KOREA
6 라이즈 오브 킹덤즈 LilithGames
7 R2M Webzen Inc.
8 바람의나라: 연 NEXON Company
9 뮤 아크엔젤 Webzen Inc.
10 V4 NEXON Company
11 S.O.S:스테이트 오브 서바이벌 KingsGroup Holdings
12 블레이드&소울 레볼루션 Netmarble
13 KartRider Rush+ NEXON Company
14 미르4 Wemade Co., Ltd
15 찐삼국 ICEBIRD GAMES
16 리니지2 레볼루션 Netmarble
17 메이플스토리M NEXON Company
18 AFK 아레나 LilithGames
19 PUBG MOBILE KRAFTON, Inc.
20 A3: 스틸얼라이브 Netmarble
21 가디언 테일즈 Kakao Games Corp.
22 Roblox Roblox Corporation
23 명일방주 Yostar Limited.
24 Pmang Poker : Casino Royal NEOWIZ corp
25 블리치: 만해의 길 DAMO NETWORK LIMITED
26 Cookie Run: OvenBreak – Endless Running Platformer Devsisters Corporation
27 라그나로크 오리진 GRAVITY Co., Ltd.
28 FIFA ONLINE 4 M by EA SPORTS™ NEXON Company
29 Lords Mobile: Kingdom Wars IGG.COM
30 한게임 포커 NHN BIGFOOT
31 Homescapes Playrix
32 그랑삼국 YOUZU(SINGAPORE)PTE.LTD.
33 Gardenscapes Playrix
34 Brawl Stars Supercell
35 검은사막 모바일 PEARL ABYSS
36 Epic Seven Smilegate Megaport
37 Age of Z Origins Camel Games Limited
38 FIFA Mobile NEXON Company
39 Top War: Battle Game Topwar Studio
40 컴투스프로야구2021 Com2uS
41 Summoners War Com2uS
42 Empires & Puzzles: Epic Match 3 Small Giant Games
43 Rise of Empires: Ice and Fire Long Tech Network Limited
44 궁3D WISH INTERACTIVE TECHNOLOGY LIMITED
45 Random Dice: PvP Defense 111%
46 Pokémon GO Niantic, Inc.
47 LifeAfter X.D. Global
48 슬램덩크 DeNA HONG KONG LIMITED
49 프린세스 커넥트! Re:Dive Kakao Games Corp.
50 스테리테일 4399 KOREA

Predicting Song Skipping on Spotify -번역

Spotify에서 노래 건너 뛰기 예측

LightGMB를 사용하여 오로지 오디오 기능만을 기반으로 내 노래 건너 뛰기 습관 예측

Introduction

2019 년 초 Spotify는 플랫폼에 대한 흥미로운 통계를 공유했습니다.서비스에있는 3 천 5 백만 곡 이상의 노래 중에서 Spotify 사용자는 20 억 개 이상의 재생 목록을 만들었습니다 (Oskar Stål, 2019).나는 우리의 음악 취향이 우리의 DNA와 같고 70 억 명의 사람들에게 매우 다양하지만 빌딩 블록 (뉴클레오타이드 / 노래)은 같다는 비유를 생각했습니다.결과적으로 Spotify의 비즈니스 모델이 새로운 노래를 추천하는 능력에 의존하기 때문에 사용자의 음악 취향을 추론하는 것이 어렵습니다.

문제 설명

Spotify에는 싫어요 버튼이므로 노래 건너 뛰는 것은 음악 취향을 추론하기 위해 배워야 할 미묘한 단서입니다.이 프로젝트에서는 2019 년 Spotify 스트리밍 기록을 사용하여 오디오 기능만을 기반으로 노래를 건너 뛸지 여부를 예측하는 예측 모델을 구축합니다.

다음 단계에 따라 자신의 Spotify 스트리밍 기록을 요청할 수 있습니다.

Data Descriptions

Spotify 데이터를 요청한 후 2019 년에들은 모든 노래, 아티스트 이름, 스트리밍 시간이 포함 된 ZIP 파일이 포함 된 이메일을 받았습니다.데이터 처리는 다음과 같습니다.

  1. 팟 캐스트를 걸러 내고 노래 만 분석합니다.
  2. Spotify API를 사용하여 노래의 고유 ID와 오디오 기능을 추출했습니다.
  3. 트랙을 스트리밍 한 시간과 노래 길이 사이의 간격을 계산합니다.갭이 60 초를 넘으면 노래를 건너 뛰었다 고 유도합니다.

아래는 단계의 자세한 파이썬 구현입니다.

오디오 특성 만 노래 건너 뛰기에 대해 알려줄 수 있는지 찾는 것이 주장이기 때문에 노래 제목과 아티스트가 포함 된 열을 삭제했습니다.

최종 데이터 세트에는 다음 열이 있습니다.

Image for post

가정

모델링의 중요한 단계는 결과를 적절하게 해석하기 위해 모든 가정과 한계를 배치하는 것입니다.일부 가정은 데이터 수집 프로세스로 인한 것이고 다른 가정은 모델링 프로세스의 일부입니다.

  • 사용자의 음악 취향은 동 질적입니다. 즉, 사용자가 노래를 건너 뛰도록 유도하는 메커니즘은 시간이 지남에 따라 고정됩니다.
  • 노래는 오디오 기능으로 분류되므로 가사는 자연어 텍스트로 해석되지 않습니다.가사 의미가 노래 건너 뛰기의 강력한 예측 요인이 될 수 있으므로이 제한 사항을 고려해야합니다.

모델링

저는 LightGBM 바이너리 분류를 사용하여 오로지 오디오 기능만을 기반으로 내 노래 건너 뛰기 습관을 추론합니다.

Image for post

베이지안 최적화

LightGBM에는 많은 매개 변수가 포함되어 있으므로 가능한 모든 값을 실행하는 대신 초 매개 변수 조정을 위해 베이지안 최적화를 사용했습니다.

Image for post

결과 & amp;토론

이 모델은 74.17 %의 정확도로 개인화 된 데이터에서 더 잘 수행됩니다 (베이 즈 최적화의 28 번째 반복).Spotify 사용자가 동종이라는 가정은 강력한 것이며, 사용자 수준 세부 정보를 더 많이 수집하면 성능이 향상 될 수 있습니다.

전반적으로 추천 엔진에는 사용자에 대한 개인화 된 학습과 노래에 대한 일반적인 학습이 모두 필요합니다.이 프로젝트에서는 오디오 기능, 오디오 및 사용자 기능, 개인 청취 기록 만 사용하여 기계 학습 분류를 실험했습니다.추가 조사에는 데이터가 생성되는 메커니즘을 이해하는 것이 곡선 맞춤보다 더 유익하기 때문에 공변량 간의 인과 관계를 포함 할 수 있습니다.

참고 문헌

  • 오스카 스톨 (2019).Spotify의 음악 추천.Nordic Data Science 및 Machine Learning Summit.다음에서 검색 :https://youtu.be/2VvM98flwq0
  • Brian Brost, Rishabh Mehrotra 및 Tristan Jehan.2019. 음악 스트리밍 세션 데이터 세트.2019 년 월드 와이드 웹 컨퍼런스 (WWW ’19), 2019 년 5 월 13 일 ~ 17 일, 미국 캘리포니아 주 샌프란시스코에서 진행 중.ACM, New York, NY, USA, 7 페이지.https://doi.org/10.1145/3308558.3313641

Predicting Song Skipping on Spotify

Predicting Song Skipping on Spotify

Using LightGMB to predict my song skipping habits based solely on audio features

Introduction

In early 2019, Spotify shared interesting statistics about their platform. Out of 35+ million songs on the service, Spotify users created over 2+ billion playlists (Oskar Stål, 2019). I thought of the analogy that our music taste is like our DNA, very diverse across 7 billion people, yet the building blocks (nucleotides/songs) are the same. As a result, inferring a user’s music taste is challenging, mostly since Spotify’s business model relies on its ability to recommend new songs.

Problem Statement

Spotify doesn’t have a dislike button, so skipping songs are the subtle cues we need to learn from to infer music taste. In this project, I use my Spotify streaming history in 2019 to build a predictive model that anticipates whether I would skip a song or not based solely on their audio features.

You can request your own Spotify streaming history following these steps

Data Descriptions

After requesting my Spotify data, I received an email with a ZIP file containing every song I listened to in 2019, the artist’s name, and the streaming duration. The data processing is as follow:

  1. I filter out podcasts and only analyze songs.
  2. I used the Spotify API to extract the unique IDs of songs and their audio features.
  3. I compute the gap between the duration I streamed the track for and the song’s length. If the gap exceeds 60 seconds, then I would induce that the song has been skipped.

Below is a detailed python implementation of the steps

Since the claim is to seek if only audio characteristics can inform us on song skipping, I dropped the columns that contain the song’s title and artist.

The final dataset has the following columns:

Image for post

Assumptions

A crucial step in modeling is to lay out all the assumptions and limitations in order to properly interpret the result. Some assumptions are due to the data collection process and others are part of the modeling process:

  • The user’s music taste is homogenous, i.e., the mechanism that leads a user to skip a song is static across time.
  • Songs are broken down into audio features hence the lyrics are not interpreted as natural language text. This limitation is important to consider since lyrical meaning can be a strong predictor of song skipping.

Modeling

I use LightGBM binary classification to infer my song-skipping habits based solely on audio features.

Image for post

Bayesian Optimization

LightGBM contains many parameters, thus, instead of running through all of their possible values, I used Bayesian Optimization for hyperparameter tuning

Image for post

Results & Discussion

The model performs better with personalized data with an accuracy of 74.17% (28th iteration of Bayesian Optimization). The assumption that Spotify users are homogeneous is a strong one, and the performance can be improved if we gather more user level details.

Overall, recommendation engines require both personalized learning about the user and general learning about the songs. In this project, I experimented with machine learning classification using only audio features, audio and user features, and my personal listening history. A further investigation might include the causal relationships between the covariates because perhaps understanding the mechanism by which the data is generated is more informative than curve-fitting.

References

  • Oskar Stål (2019). Music Recommendations at Spotify. Nordic Data Science and Machine Learning Summit. Retrieved from: https://youtu.be/2VvM98flwq0
  • Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The Music Streaming Sessions Dataset. In Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3308558.3313641

2021년 1월 16일 수원시 부동산 경매 데이터 분석

Facebook’s PyGraph is an Open Source Framework for Capturing Knowledge in Large Graphs

Facebook’s PyGraph is an Open Source Framework for Capturing Knowledge in Large Graphs

The new framework can learn graph embeddings in large graph structures.

Image for post

Source: https://morioh.com/p/fdc360a84d73

I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

Graphs are one of the fundamental data structures in machine learning applications. Specifically, graph-embedding methods are a form of unsupervised learning, in that they learn representations of nodes using the native graph structure. Training data in mainstream scenarios such as social media predictions, internet of things(IOT) pattern detection or drug-sequence modeling are naturally represented using graph structures. Any one of those scenarios can easily produce graphs with billions of interconnected nodes. While the richness and intrinsic navigation capabilities of graph structures is a great playground for machine learning models, their complexity posses massive scalability challenges. Not surprisingly, the support for large-scale graph data structures in modern deep learning frameworks is still quite limited. Recently, Facebook unveiled PyTorch BigGraph, a new framework that makes it much faster and easier to produce graph embeddings for extremely large graphs in PyTorch models.

To some extent, graph structures can be seen as an alternative to labeled training dataset as the connections between the nodes can be used to infer specific relationships. This is the approach followed by unsupervised graph embedding methods which learn a vector representation of each node in a graph by optimizing the objective that the embeddings for pairs of nodes with edges between them are closer together than pairs of nodes without a shared edge. This is similar to how word embeddings like word2vec are trained on text.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Most graph embedding methods result quite constrained when applied to large graph structures. To give a example, a model with two billion nodes and 100 embedding parameters per node (expressed as floats) would require 800GB of memory just to store its parameters, thus many standard methods exceed the memory capacity of typical commodity servers. To represents a major challenge for deep learning models and is the genesis of Facebook’s BigGraph framework.

PyTorch BigGraph

The goal of PyTorch BigGraph(PBG) is to enable graph embedding models to scale to graphs with billions of nodes and trillions of edges. PBG achieves that by enabling four fundamental building blocks:

  • graph partitioning, so that the model does not have to be fully loaded into memory
  • multi-threaded computation on each machine
  • distributed execution across multiple machines (optional), all simultaneously operating on disjoint parts of the graph
  • batched negative sampling, allowing for processing >1 million edges/sec/machine with 100 negatives per edge

PBG addresses some of the shortcomings of traditional graph embedding methods by partitioning the graph structure into randomly divided into P partitions that are sized so that two partitions can fit in memory. For example, if an edge has a source in partition p1 and destination in partition p2 then it is placed into bucket (p1, p2). In the same model, the graph edges are then divided into P2 buckets based on their source and destination node. Once the nodes and edges are partitioned, training can be performed on one bucket at a time. The training of bucket (p1, p2) only requires the embeddings for partitions p1 and p2 to be stored in memory. The PBG structure guarantees that buckets have at least one previously-trained embedding partition.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Another area in which PBG really innovates is the parallelization and distribution of the training mechanics. PBG uses PyTorch parallelization primitives to implement a distributed training model that leverages the block partition structure illustrated previously. In this model, individual machines coordinate to train on disjoint buckets using a lock server which parcels out buckets to the workers in order to minimize communication between the different machines. Each machine can train the model in parallel using different buckets.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

In the previous figure, the Trainer module in machine 2 requests a bucket from the lock server on machine 1, which locks that bucket’s partitions. The trainer then saves any partitions that it is no longer using and loads new partitions that it needs to and from the sharded partition servers, at which point it can release its old partitions on the lock server. Edges are then loaded from a shared filesystem, and training occurs on multiple threads without inter-thread synchronization. In a separate thread, a small number of shared parameters are continuously synchronized with a sharded parameter server. Model checkpoints are occasionally written to the shared filesystem from the trainers. This model allows a set of P buckets to be parallelized using up to P/2 machines.

One of the indirect innovations of PBG is the use of batched negative sampling techniques. Traditional graph embedding models, construct random “false” edges as negative training examples along with the true positive edges. This significantly speeds training because only a small percentage of weights must be updated with each new sample. However, the negative samples end up introducing a performance overhead in the processing of the graph and end up “corrupting” true edges with random source or destination nodes. PBG introduces a method that reuses a single batch of N random nodes to produce corrupted negative samples for N training edges. In comparison to other embedding methods, this technique allows us to train on many negative examples per true edge at little computational cost.

To increase memory efficiency and computational resources on large graphs, PBG leverages a single batch of Bn sampled source or destination nodes to construct multiple negative examples.In a typical setup, PBG takes a batch of B = 1000 positive edges from the training set, and breaks it into chunks of 50 edges. The destination (equivalently, source) embeddings from each chunk is concatenated with 50 embeddings sampled uniformly from the tail entity type. The outer product of the 50 positives with the 200 sampled nodes equates to 9900 negative examples.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

The batched negative sampling approach has a direct impact in the speed of the training of the models. Without batching, the speed of training is inversely proportional to the number of negative samples. Batched training improves that equation achieving constant training speed.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Facebook evaluated PGB using different graph datasets such as LiveJournal, Twitter data and YouTube user interaction data. Additionally, PBG was benchmarked using the Freebase knowledge graph, which contains more than 120 million nodes and 2.7 billion edges as well as a smaller subset of the Freebase graph, known as FB15k, which contains 15,000 nodes and 600,000 edges and is commonly used as a benchmark for multi-relation embedding methods. The FB15k experiments showed PBG performing similarly to state of the art graph embedding models. However, when evaluated against the full Freebase dataset, PBG show memory consumptions improves by over 88%.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

PBG is one of the first methods that can scale and the training and processing of graph data to structures with billions of nodes and trillions of edges. The first implementation of PBG has been open sourced in GitHub and we should expect interesting contributions in the near future.

Facebook’s PyGraph is an Open Source Framework for Capturing Knowledge in Large Graphs -번역

Facebook의 PyGraph는 큰 그래프로 지식을 캡처하기위한 오픈 소스 프레임 워크입니다.

새로운 프레임 워크는 큰 그래프 구조에서 그래프 임베딩을 학습 할 수 있습니다.

Image for post

Source: https://morioh.com/p/fdc360a84d73

저는 최근에 AI 교육에 관한 새로운 뉴스 레터를 시작했습니다.TheSequence는 읽는 데 5 분이 소요되는 비 BS (과장, 뉴스 없음 등) AI 중심 뉴스 레터입니다.목표는 기계 학습 프로젝트, 연구 논문 및 개념에 대한 최신 정보를 유지하는 것입니다.아래에서 구독하여 사용해보세요.

그래프는 기계 학습 애플리케이션의 기본 데이터 구조 중 하나입니다.특히 그래프 임베딩 방법은 기본 그래프 구조를 사용하여 노드 표현을 학습한다는 점에서 비지도 학습의 한 형태입니다.소셜 미디어 예측, 사물 인터넷 (IOT) 패턴 탐지 또는 약물 시퀀스 모델링과 같은 주류 시나리오의 훈련 데이터는 그래프 구조를 사용하여 자연스럽게 표현됩니다.이러한 시나리오 중 하나는 수십억 개의 상호 연결된 노드가있는 그래프를 쉽게 생성 할 수 있습니다.그래프 구조의 풍부함과 고유 한 탐색 기능은 기계 학습 모델을위한 훌륭한 놀이터이지만 그 복잡성은 엄청난 확장 성 문제를 안고 있습니다.당연히 최신 딥 러닝 프레임 워크에서 대규모 그래프 데이터 구조에 대한 지원은 여전히 상당히 제한적입니다.최근에 페이스 북은PyTorch BigGraph, PyTorch 모델에서 매우 큰 그래프에 대한 그래프 임베딩을 훨씬 빠르고 쉽게 생성 할 수있는 새로운 프레임 워크입니다.

어느 정도까지는 노드 간의 연결을 사용하여 특정 관계를 추론 할 수 있으므로 그래프 구조는 레이블이 지정된 학습 데이터 세트의 대안으로 볼 수 있습니다.이것은 노드 사이에 에지가있는 노드 쌍에 대한 임베딩이 공유 에지가없는 노드 쌍보다 서로 더 가깝다는 목표를 최적화하여 그래프에서 각 노드의 벡터 표현을 학습하는 비지도 그래프 임베딩 방법에 따른 접근 방식입니다.이것은 word2vec과 같은 단어 임베딩이 텍스트에서 학습되는 방식과 유사합니다.

Image for post

출처:https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

대부분의 그래프 임베딩 방법은 큰 그래프 구조에 적용될 때 상당히 제한적입니다.예를 들어, 노드 당 20 억 개의 노드와 100 개의 임베딩 매개 변수가있는 모델 (부동으로 표시됨)은 매개 변수를 저장하는 데 800GB의 메모리가 필요하므로 많은 표준 방법이 일반적인 상용 서버의 메모리 용량을 초과합니다.딥 러닝 모델의 주요 과제를 나타내는 것은 Facebook의 BigGraph 프레임 워크의 기원입니다.

PyTorch BigGraph

PyTorch BigGraph (PBG)의 목표는 그래프 임베딩 모델을 수십억 개의 노드와 수조 개의 에지가있는 그래프로 확장 할 수 있도록하는 것입니다.PBG는 다음과 같은 네 가지 기본 구성 요소를 활성화하여이를 달성합니다.

  • 그래프 분할, 모델을 메모리에 완전히로드 할 필요가 없습니다.
  • 다중 스레드 계산각 컴퓨터에서
  • 분산 실행여러 컴퓨터 (선택 사항)에서 모두 동시에 그래프의 분리 된 부분에서 작동
  • 일괄 네거티브 샘플링, 에지 당 100 만 개의 네거티브를 사용하여 기계 당 초당 100 만 개의 에지를 처리 할 수 있습니다.

PBG는 두 개의 파티션이 메모리에 들어갈 수 있도록 크기가 조정 된 P 파티션으로 그래프 구조를 무작위로 분할하여 기존 그래프 임베딩 방법의 일부 단점을 해결합니다.예를 들어 에지의 소스가 파티션 p1에 있고 대상이 파티션 p2에있는 경우 버킷 (p1, p2)에 배치됩니다.동일한 모델에서 그래프 에지는 소스 및 대상 노드를 기반으로 P2 버킷으로 나뉩니다.노드와 에지가 분할되면 한 번에 하나의 버킷에서 훈련을 수행 할 수 있습니다.버킷 (p1, p2) 훈련에는 파티션 p1 및 p2에 대한 임베딩 만 메모리에 저장되어야합니다.PBG 구조는 버킷에 이전에 훈련 된 임베딩 파티션이 하나 이상 있음을 보장합니다.

Image for post

출처:https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

PBG가 진정으로 혁신하는 또 다른 영역은 훈련 메커니즘의 병렬화 및 배포입니다.PBG는 PyTorch를 사용합니다.병렬화 프리미티브앞에서 설명한 블록 파티션 구조를 활용하는 분산 학습 모델을 구현합니다.이 모델에서 개별 기계는 서로 다른 기계 간의 통신을 최소화하기 위해 작업자에게 버킷을 분할하는 잠금 서버를 사용하여 분리 된 버킷에서 훈련하도록 조정합니다.각 머신은 서로 다른 버킷을 사용하여 모델을 병렬로 학습시킬 수 있습니다.

Image for post

출처:https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

이전 그림에서 머신 2의 Trainer 모듈은 머신 1의 잠금 서버에서 버킷을 요청하여 해당 버킷의 파티션을 잠급니다.그런 다음 트레이너는 더 이상 사용하지 않는 파티션을 저장하고 샤드 된 파티션 서버에서 필요한 새 파티션을로드합니다. 이때 이전 파티션을 잠금 서버에서 해제 할 수 있습니다.그런 다음 공유 파일 시스템에서 에지를로드하고 스레드 간 동기화없이 여러 스레드에서 학습을 수행합니다.별도의 스레드에서 소수의 공유 매개 변수가 분할 된 매개 변수 서버와 지속적으로 동기화됩니다.모델 체크 포인트는 때때로 트레이너가 공유 파일 시스템에 작성합니다.이 모델을 사용하면 최대 P / 2 머신을 사용하여 P 버킷 세트를 병렬화 할 수 있습니다.

PBG의 간접적 인 혁신 중 하나는 일괄 네거티브 샘플링 기술을 사용하는 것입니다.기존의 그래프 임베딩 모델은 임의의 “거짓”에지를 참 양성 에지와 함께 음성 학습 예제로 구성합니다.이렇게하면 새 샘플마다 적은 비율의 가중치 만 업데이트해야하므로 훈련 속도가 크게 빨라집니다.그러나 네거티브 샘플은 결국 그래프 처리에 성능 오버 헤드를 유발하고 결국 임의의 소스 또는 대상 노드로 실제 에지를 “손상”시킵니다.PBG는 N 개의 임의 노드의 단일 배치를 재사용하여 N 개의 훈련 에지에 대해 손상된 음수 샘플을 생성하는 방법을 도입했습니다.다른 임베딩 방법과 비교하여이 기술을 사용하면 적은 계산 비용으로 실제 에지 당 많은 네거티브 예제를 학습 할 수 있습니다.

큰 그래프에서 메모리 효율성과 계산 리소스를 높이기 위해 PBG는 Bn 샘플링 된 소스 또는 대상 노드의 단일 배치를 활용하여 여러 개의 네거티브 예제를 구성합니다. 일반적인 설정에서 PBG는 훈련 세트에서 B = 1000 포지티브 에지의 배치를 가져옵니다.50 개의 가장자리 조각으로 나눕니다.각 청크의 대상 (동등하게, 소스) 임베딩은 꼬리 항목 유형에서 균일하게 샘플링 된 50 개의 임베딩과 연결됩니다.200 개의 샘플링 된 노드와 50 개의 긍정의 외적은 9900 개의 부정적인 예와 같습니다.

Image for post

출처:https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

일괄 음수 샘플링 접근 방식은 모델 학습 속도에 직접적인 영향을 미칩니다.일괄 처리를 사용하지 않으면 학습 속도가 음수 샘플 수에 반비례합니다.일괄 훈련은 해당 방정식을 개선하여 일정한 훈련 속도를 달성합니다.

Image for post

출처:https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Facebook은 LiveJournal, Twitter 데이터 및 YouTube 사용자 상호 작용 데이터와 같은 다양한 그래프 데이터 세트를 사용하여 PGB를 평가했습니다.또한 PBG는 15,000 개의 노드와 600,000 개의 에지를 포함하고 일반적으로 사용되는 FB15k로 알려진 Freebase 그래프의 더 작은 하위 집합뿐 아니라 1 억 2 천만 개 이상의 노드와 27 억 개의 에지를 포함하는 Freebase 지식 그래프를 사용하여 벤치마킹되었습니다.다중 관계 임베딩 방법에 대한 벤치 마크.FB15k 실험은 PBG가 최신 그래프 임베딩 모델과 유사한 성능을 보이는 것으로 나타났습니다.그러나 전체 Freebase 데이터 세트에 대해 평가했을 때 PBG는 메모리 소비가 88 % 이상 향상되었음을 보여줍니다.

Image for post

출처:https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

PBG는 수십억 개의 노드와 수조 개의 에지가있는 구조로 그래프 데이터를 확장하고 학습 및 처리 할 수있는 최초의 방법 중 하나입니다.PBG의 첫 번째 구현GitHub에서 오픈 소스되었습니다.그리고 우리는 가까운 장래에 흥미로운 기여를 기대해야합니다.

Kubernetes is deprecating Docker in the upcoming release

Kubernetes is deprecating Docker in the upcoming release

Kubernetes and Docker will part ways; what does that mean to you?

Photo by CHUTTERSNAP on Unsplash

This moment was long in coming; Kubernetes is deprecating Docker as a container runtime after version 1.20, in favor of runtimes that use the Container Runtime Interface(CRI) created for Kubernetes. However, this does not mean Docker’s death, and it does not mean that you should also abandon your favorite containerization tool.

As a matter of fact, not a whole lot will be changing for you, as an end-user of Kubernetes; you will still be able to build your containers using docker and the images produced by running docker build will still run in your Kubernetes cluster.

Then, why all this fuss? What is changing, and why Docker seems like the black-sheep out of a sudden? Should we continue writing Dockerfiles?

Learning Rate is my weekly newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news, research, repos and books. Subscribe here!

Don’t panic

The source of confusion here is that we are talking about two different things. Inside the Kubernetes cluster nodes, a container runtime daemon manages the complete container lifecycle: image pulling and storage, container execution and supervision, network attachments, and many more.

Docker is arguably the most popular choice; however, Docker was not designed to be embedded inside Kubernetes. That is the root of all problems. Docker is not just a container runtime; it is an entire tech stack with many UX enhancements that make it easy for us to interact with it. Indeed, Docker contains a high-level container runtime in itself: contrainerd. And containerd will be a container runtime option for you moving forward.

Moreover, these UX enhancements are not necessary for Kubernetes. If anything, they are obstacles that Kubernetes must workaround to get what it really needs. This means that the Kubernetes cluster has to use another tool called Dockershim, which is containerd. That adds a level of complexity and another tool that the team should maintain. Another source that could produce bugs and problems.

So, what is really happening here is that Kubernetes will remove Dockershim in version 1.23, which will break Docker support.

Should you care?

So, what is changing for you as a developer? Not that much. If you are using Docker in your development process, you will continue to do that, and you will not notice any differences. When you build an image using Docker, the result is not a Docker-specific thing. It’s an OCI (Open Container Initiative) image. Kubernetes and its compliant container runtimes (e.g., containerd or CRI-O) know how to pull and work with those images. This is why we have a standard for what containers should look like in the first place.

On the other hand, if you are using a managed Kubernetes service, like GKE or EKS, you will need to make sure that your nodes are running a supported container runtime before Docker support is removed re-apply or update your custom configurations if you use any. If you are running Kubernetes on-premise, you should also need to make changes to avoid unwanted problems and surprises.

Conclusion

At version 1.20, you will get a deprecation warning for Docker. This change is coming, and like any other, it will likely cause some issues at first. But it isn’t catastrophic, and in the long run, it’s going to make things easier.

I hope this article made some things clearer and relieved some anxieties. At the end of the day, these changes will probably mean nothing to you as a developer.

Learning Rate is my weekly newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news, research, repos and books. Subscribe here!

About the Author

My name is Dimitris Poulopoulos, and I’m a machine learning engineer working for Arrikto. I have designed and implemented AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.

If you are interested in reading more posts about Machine Learning, Deep Learning, Data Science, and DataOps, follow me on Medium, LinkedIn, or @james2pl on Twitter.

Opinions expressed are solely my own and do not express the views or opinions of my employer.

The Reason Behind if __name__ == ‘__main__’ in Python -번역

The Reason Behind if __name__ == ‘__main__’ in Python

Why is it necessary?

the statement if_name_==’_main_’:

Photo by author generated from carbon

You might have seen this one before: the syntax which often gets ignored because it doesn’t seem to hinder the execution of your code. It may not seem necessary, but that’s only if you’re working with a single Python file.

Let’s Get Right Into It!

Let’s start out by deconstructing the statement from left to right. We already know what an if statement is; however, the most important part of the statement are the two things being compared.

Let’s start with __name__ .This is used to denote the name of the file that is currently being run, but there is a trick to this. The file currently being run will always have the value __main__ .

This sounds confusing at first but let’s clarify.

Let’s create two Python files:

  • current_script.py
  • other_script.py

Please ensure these files are in the same directory/folder.

Inside the other_script.py file, we’ll add a two-print statement, just as shown below.

print("****inside other script.py*****")print("__name__ is ",__name__)

Run this other_script.py file.

Note: I will be running this file while using Python within the terminal, just as illustrated below. Also note that I am working from a Windows operating system.

python other_script.py
Image for post

Output:

****inside other script.py*****
__name__ is __main__

Now you realize that it’s just as I stated before. The file being executed will always have the value __main__. This represents the point of entry into our application.

In Python and pretty much every programming language, we can import other files into our application. We’ll now go into our current_script.py and input the following code:

import other_scriptprint("")print("****inside current script.py*****")print("__name__ is ",__name__)

The code above imports the other_script.py with the import statement at the top, which is followed by print(“****inside current script.py*****”) to verify that we are in thecurrent_script.py file.

Be aware that because we imported other_script at the top of the file, this therefore means that the entire contents of other_script.py will now be injected into where import other_script is.

Before we continue, take keen note of the input of when we ran other_script.py. Now observe what happens when we execute current_script.py.

python current_script.py
Image for post

Output:

****inside other script.py*****
__name__ is other_script
****inside current script.py*****
__name__ is __main__

You will now realize that previously when we ran other_script.py, it gave us the value for __name__ as __main__. But now since we ran it as an import in current_script.py, the value of __name__ suddenly changed to the name of the imported script which is other_script.

Furthermore, the value of __name__ for current_script.py is __main__. This goes back to what I had highlighted previously: The file currently being run will always have the value __main__.

Let’s put this all together now.

The file you are currently running will always be __main__, while any other imported files will not be. Those will have the name of their respective files.

Use Cases

This syntax comes in handy when you have programs that have multiple Python files.

Let’s create a program that has two classes. A Name class and a Person class. These two classes will be placed in two separate files, name.py and person.py. The Person class uses theName class in this system.

We’ll start out by building the Name class in the name.py file. This is a simple class that has only two attributes, fname (first name) and lname (last name) along with their corresponding getters and setters.

__repr__ is the default output when the object is printed.

We added our syntax if__name__ == “__main__:”. Based on our understanding, we can tell that the body of this if statement will only be executed if the main file is the one being executed — meaning it is not an import.

But why do we do this?

We do this because it is one of the most important steps when we want certain operations to be done only on the file we are currently running. In this scenario, we wrote a Name class and we are testing out its functionalities.

Output:

fname=Jordan;lname=Williams

As you can see from the output above, we were able to test the functionality of the Name class. However, this concept will not hit home until we’ve built the other class.

Let’s create our person.py file.

Notice from name import Name, where I imported the Name class into our file, it was used on line 7 whereself.name = Name(fname, lname).

Output:

201107
John Brown
Male

This is the output from testing our Person class. Notice that there is no output from the Name class because it is encased under the condition __name__ == “__main__” and it is currently not the main file.

Let’s now remove __name__ == “__main__” from name.py and see the difference:

Notice that __name__ == “__main__” is not removed. We will now run our person.py file.

Output:

fname=Jordan;lname=Williams
201107
John Brown
Male

See, here we only wanted to test the functionality of the Person class. However, we are getting outputs from the Name class as well.

This could also be a problem if you had made some Python library and wanted to extend that functionality to another class but do not want that other library to run automatically in your current script.

Other Languages

Some of you who dabble in other programming languages might have noticed that this is the same as the main method or functions found in other languages. Programming languages such as Java with public static void main(String[] args) , C# with a similar public static void Main (string[] args) and C with int main(void) all have some sort of main function or method present to execute multiple files/scripts of code.

Let’s look at the equivalent code in another language.

Let’s look at Java for instance.

Summary

Sometimes you want to run some logic only in the file currently being executed. This comes in handy in testing individual units of code without the hindrance of other files being executed. This can come in handy when building libraries that depend on other libraries. You wouldn’t want a rogue execution of another library in the code you are in.