Tableau’s relationships are pretty cool

Tableau’s relationships are pretty cool

Unlike joins, relationships preserve the native granularity of data, reducing the need for LOD expressions.

Last summer, Tableau introduced a new way of combining data. It is called relationships. The old way of combining data using joins is still available, and I imagine that many of us might stick with the familiar joins for a while. However, relationships have much to recommend them and this post will show some of their ins and outs. Consider the three tables below:

Image by author.

Using joins Tableau would combine these tables into one flat file like this:

Image for post

Joining tables into one flat file changes the granularity of some data. Image by author.

Joining tables with different levels of granularity duplicates observations in tables with more aggregate levels of granularity — in this case, director and ratings tables. Correctly summarizing measures from these tables requires using LOD expressions which can be challenging.

Relationships preserve the native level of granularity of each table. Users merely define how tables are related and joins are performed as needed.

Image for post

Relationships preserve the native granularity of the data. Image by author.

When we create summaries that involve measures from these tables, each measure is summarized using its native level of granularity.

Image for post

Image by author.

The average age of our two directors is 60 — never mind that James Cameron has five weekly box office entries and Rian Johnson only two. Since age comes from the directors table whose native granularity is director, the average is calculated across directors. This is natural to anyone new to Tableau, but to those of us used to using LOD expressions to correct for duplication, and used to thinking of our data as one joined flat file, this is rather revolutionary.

Let’s illustrate that idea again in the scatter plot below. We are plotting director’s age against sum of box office by movie. Tableau adds up box office revenue for each movie from the box office table, but it takes director’s age from the director’s table and since each director is in that table only once, age does not get duplicated.

Image for post

Image by author.

Notice also that with relationships we don’t have to worry about whether to use right, left, full or inner join. If a dimension exists in any of the tables used in the view, the dimension is included in the viz. There are no nulls generated for measures of dimensions that are present in one table but not another.

Image by author.

Another change that comes with relationships is the disappearance of the Number of Records field. Now, there is a (count) field associated with each table. This makes sense since each table preserves its level of granularity and thus has its own number of records.

Relationships are not perfect. At the moment, they don’t support calculation “joins”, i.e. it is impossible to define a relationship based on a calculation. This means that any operations required for establishing a relationship, such as splitting a field or splicing two fields, need to be done prior to connecting to the source. Let’s hope that Tableau builds calculations into relationships soon.

It is also worth noting that data source filters extend across related tables. You may think that the tables are kept separate but that is not the case: a filter on values in one table will apply to matching values in related tables. For example, adding a filter to include only directors who are 50 years old, excludes not just James Cameron from the directors table but also all of his movies from the movies table. (It also excludes Jordan Peele’s Us unless we specify to include nulls.)

Tableau does a great job explaining relationships here. While it takes some getting used to the logic of relationships they are definitely worth a try.

Image for post

Image by author.

Responsible AI at Facebook

APPLE | GOOGLE | SPOTIFY | OTHERS

Responsible AI at Facebook

To select chapters, visit the Youtube video here.

Editor’s note: This episode is part of our podcast series on emerging problems in data science and machine learning, hosted by Jeremie Harris. Apart from hosting the podcast, Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

Listen on Apple, Google, Spotify

Facebook routinely deploys recommendation systems and predictive models that affect the lives of billions of people everyday. That kind of reach comes with great responsibility — among other things, the responsibility to develop AI tools that ethical, fair and well characterized.

This isn’t an easy task. Human beings have spent thousands of years arguing about what “fairness” and “ethics” mean, and haven’t come close to a consensus. Which is precisely why the responsible AI community has to involve as many disparate perspectives as possible in determining what policies to explore and recommend — a practice that Facebook’s Responsible AI team has applied itself.

For this episode of the podcast, I’m joined by Joaquin Quiñonero-Candela, the Distinguished Tech Lead for Responsible AI at Facebook. Joaquin has been at the forefront of the AI ethics and fairness movements for years, and has overseen the formation of Facebook’s responsible AI team. As a result, he’s one of relatively few people with hands-on experience making critical AI ethics decisions at scale, and seeing their effects.

Our conversation covered a lot of ground, from philosophical questions about the definition of fairness, to practical challenges that arise when implementing certain ethical AI frameworks. Here were some of my favourite take-homes:

  • Joaquin highlights three different ways of thinking about fairness in the context of AI, and all of them are concerned to some extent with comparing the performance of a given algorithm across different groups of people (genders, races, etc). The first is to consider a system fair if it achieves a minimum level of performance for every group. This would involve setting standards that sound like, “no ethnicity should be misclassified more than 5% of the time.”
    The second approach goes further, requiring equality: the idea that algorithm performance shouldn’t vary too much between groups. Applying this standard would lead to requirements like, “no ethnicity should be misclassified more than 3% more often than any other, on average.”
    The last strategy is to minimize variance in outcomes among groups. The idea here is to apply the prior that “all groups are equally likely to perform behaviour X, so an algorithm that fails to predict that all groups will do X with equal probability must be unfair.”
  • One key question is: who gets to decide which fairness standard applies to a given situation? Joaquin sees this as the core of the fairness issue: we’re not going to resolve all of our profound moral debates anytime soon, so rather than pretend to have moral clarity now, we should focus on creating processes that allow the fairness of specific decisions to be determined on a case-specific basis. For that reason, his team’s mantra is that “Fairness is a process” (and he’s literally got a t-shirt to prove it!)
  • Whatever the fairness process involves, it’s clear that we shouldn’t expect data scientists to handle it all themselves. Data scientists don’t have time to become masters of ethics and philosophy, so it’s necessary to set up structures that abstract away these moral questions before they reach technical teams. To that end, Facebook’s Responsible AI initiative works according to a “hub-and-spokes” model, in which a core team is responsible for setting ethical standards in an open and transparent way, and those standards are then converted by specialized teams into mathematical definitions that can be implemented by data scientists.
  • Joaquin discusses the growing importance of collaboration and coordination among large companies, to ensure that more universal standards of responsible AI development are established, which don’t just reflect the incentives or perspective of a single organization. In that spirit, he now works with the Partnership on AI, an umbrella organization that brings together major players like Facebook, Google, Microsoft and OpenAI to coordinate on questions related to AI ethics best practices, and safe AI development.

You can follow Joaquin on Twitter here, or follow me on Twitter here

Links referenced during the podcast:

Chapters:

  • 0:00 Intro
  • 1:32 Joaquin’s background
  • 8:41 Implementing processes
  • 15:12 No single optimization function
  • 17:51 Definitions of fairness
  • 28:03 How fairness is managed at Facebook and Twitter
  • 32:04 Algorithmic discrimination
  • 37:11 Partnership on AI
  • 44:17 Wrap-up

Please find below the transcript:

Jeremie (00:00):

Hey, everyone. Jeremie here. Welcome back to the podcast. I’m really excited about today’s episode because we’ll be talking about how responsible AI is done at Facebook. Now, Facebook routinely deploys recommender systems and predictive models that affect the lives of literally billions of people every day. And with that kind of reach comes huge responsibility, among other things, the responsibility to develop AI tools that are ethical and fair, as well as well characterized.

Jeremie (00:22):

This really isn’t an easy task. Human beings have spent thousands of years arguing about what fairness and ethics even mean, and we haven’t come to anything close to consensus on these issues. And that is exactly why the Responsible AI community has to involve as many disparate perspectives as possible when they’re determining what policies to explore and recommend. And that’s a practice that Facebook’s responsible AI team has applied itself.

Jeremie (00:44):

Now, for this episode of the podcast, I’m joined by Joaquin Quinoñero-Candela, the distinguished tech lead for Responsible AI at Facebook. Joaquin’s been at the forefront of AI ethics and the AI Fairness Movement for years. And he’s overseen the formation of Facebook’s entire Responsible AI team basically from scratch. So, he’s one of relatively few people with hands-on experience making critical AI ethics decisions at scale and seeing their effects play out.

Jeremie (01:09):

Now, our conversation is going to cover a whole lot of ground from philosophical questions about the definition of fairness itself, to practical challenges that arise when implementing certain AI ethics frameworks. This is a great one. I’m so excited to be able to share this episode with you, and I hope you enjoy it as well. Joaquin, well, thanks so much for joining me for the podcast.

Joaquin (01:26):

Thank you, Jeremie. Thank you for having me. It’s a great pleasure and I absolutely love your podcast.

Jeremie (01:32):

Well, thanks so much. It’s a thrill to have you here. I’m really excited for this conversation. There’s so much that we can discuss, but I think there’s a situational piece that’ll be useful to tee things up. And that’s a short conversation about your background, how you got into this space, how somebody just came in from outside of ML and academic journey and then eventually came to the position that you’re in, which is leading up the Responsible AI initiative at Facebook. How did you get there?

Joaquin (01:56):

We have to have a deal where you can interrupt me whenever you want because once I get talking about my journey, I can go long. I think at a very, very high level, I finished my PhD in machine learning in 2004. At the time, very few people used the word AI in the community that I was in. The NeurIPS community was mostly an ML community. I had a little stint of being an academic. I was a postdoc at the Max Planck Institute in Germany. My first transition was I joined Microsoft Research in Cambridge in the UK in January 2007 as a research scientist.

Joaquin (02:43):

And so what happened there was a very fundamental thing in my career, which is that I came across product teams that were using machine learning. In particular, we started to talk to the team that would become the Bing organization, so Microsoft’s search engine, before it launched. And we started to apply ML to increase the relevance of ads for people. And specifically, we were building models that would predict whether someone would click on an ad if they were to see that ad. That work really got me hooked to the idea that ML was not just something that you did in scientific papers, but it was something that you could actually put in production.

Joaquin (03:24):

So, in 2009, the Bing organization asked me, “Hey, why don’t you leave the comfort of Microsoft Research, that comfortable sort of research environment, and why don’t you become an engineering manager and lead a team within Bing?” So, I did just that. It was probably one of the most stressful and dramatic transitions in my life because I knew nothing about being an engineering leader. I wasn’t used to doing roadmapping, headcount planning, budgeting, being on call, being responsible for key production components. So, it was pretty stressful. But at the same time, it was wonderful because it gave me the view of the other side.

Joaquin (04:04):

And one theme that has been recurring in my whole career and still is now is this idea of technology transfer. The funnel that goes from a whiteboard and some math, some idea, some research, some experiment, all the way to a production system that needs to be maintained and running.

Joaquin (04:22):

And so this transition gave me a first person view into that other world on the other side of research, that other world where sometimes as a researcher, you’d go in and you’d say, “Hey, you all should consider using this cool model I just built,” and then people would go like, “I don’t have time for this. I’m busy.” So, now all of a sudden, I was on the other side. It was a great experience, ended up also helping to build the auction optimization team as well. So, that got me to learn a little bit about mechanism design and economics and auctions.

Joaquin (04:57):

And then I came to Facebook in May 2012. I joined the Ads organization. I started to build out the machine learning team for Ads. And then I think the second big moment in my career happened, which is that, I think the phrase I would tell everyone is it just hit me really hard that wizards don’t scale.

Joaquin (05:16):

So, the challenge we had is we had a few brilliant machine learning researchers at Facebook, but the number of problems we needed to solve just kept increasing. And we were almost gated by our own ability to move fast. People would be building models in their own box, having their own process for shipping them to production, and I felt the whole thing was slow. And the obsession became, how can we accelerate research to production? How can we build the factory?

Joaquin (05:51):

And that was an interesting time because all of a sudden, the difficult decision there was to say, well, rather than rushing to build the most complicated neural network for ranking and recommendations possible, let’s just stick to the technology we have, which is reasonable. We were using boosted decision trees and we were using online linear models and things like that, which are relatively simple. At large-scale, that’s maybe not that simple. But the idea was, can we ship every week? Can we go from taking many weeks from-

Jeremie (06:26):

Wow.

Joaquin (06:26):

Yeah, that was the vision. The vision was ship every week. It became almost like the mantra. Everyone was like, “Ship every week.” And so that led us to build the entire ecosystem for managing the path from research to production. And it did many things. It’s a set of tools that we built across the company. It’s interesting because it was focused on ads only. The idea was that it would be very easy for anyone in ads building any kind of model to predict any kind of event on any kind of surface to actually share code, share ideas, share experimental results.

Joaquin (07:00):

Almost also actually to know who to talk to. Actually, a big thing is to build a community. The vision was for this to be agnostic to the framework you actually express your models in. When TensorFlow and Keras became popular, we supported that of course. We support PyTorch. But if you really want to write stuff from scratch in C++, be our guest. It’s pretty agnostic. It’s all about managing workflows.

Joaquin (07:28):

And what ended up happening is that teams started to ask to adopt what we were building. And fast forward in time, what happened is that the entire company adopted what we built. We started to add specialized services on top for computer vision, for NLP, for speech, for recommendations, machine translation, basically everything. And that led to the creation of a team called Applied ML, which I helped build and led for a few years, which essentially just provided for the entire company and democratized ML and put it into the hands of all product teams.

Joaquin (08:06):

The idea was that since wizards don’t scale, you don’t need that many wizards, that the wizards will be the creators of fundamental innovation, but then you can have an ecosystem of hundreds, if not thousands of engineers across a company who can very easily leverage and build on those things. So, I did that until three years ago essentially. And I’ll take a breath here in case you want to ask me any question about this part before we transition into, I guess, what would become then the third big moment, which is the transition to Responsible AI.

Jeremie (08:41):

One of the big take-homes for me in this, especially with the second phase of your career, is you’re talking about implementing processes, like getting really, really good at implementing processes. And it seems to me, anytime people talk about AI ethics, AI fairness, that sort of thing, that this is the thing that gives me chills or makes me a little bit nervous about the whole space, is the apparent lack of consistency.

Jeremie (09:04):

It’s hard to know what AI fairness is, what AI ethics is. It seems like everyone has a different definition. And it’s also unclear to me how, even once we have those definitions, we can then codify them into processes that are replicable and scalable. So, that expertise must’ve been critical, right? I mean, that process development flow, that’s going to map onto the fairness in AI work?

Joaquin (09:27):

It does. Now, you hit the nail on the head. It’s interesting because a lot of the things we did when we were democratizing AI inside the company are being very helpful now that we’re working on Responsible AI, although maybe that was not the first idea we had in mind. But the idea that right now we have the tooling that allows us to see what are all of the models that are deployed in production across Facebook and who’s built them and what went into them turns out to be to give you this level of obstruction and consistency, which is really important. But of course, there’s a lot more to Responsible AI. Like you said, there is broad disagreement on definitions of concepts. So, maybe let me tell you about my own journey into that world.

Joaquin (10:23):

I had been following with the corner of my eye the Fairness, Accountability, and Transparency Workshop. I believe the first year it took place was probably 2014 or something like that at NeurIPS, and then it kept recurring. And then I believe 2018 might be the year where it spun out. And in fact, I attended that workshop. It became its own sort of separate conference and it took place, 2018 it was in New York.

Joaquin (10:53):

But just before that end of 2017 at the New York’s conference, there were a couple of fundamental keynotes by people like Kate Crawford or Solon Barocas and many others. And it just became obvious to me, like my whole brain exploded in a way and I thought, the time is now. It’s obvious the time is now. We need to put the food on the gas here. We need to go all in. We need to go from initial efforts to really, really build a big dedicated effort that focuses on Responsible AI.

Joaquin (11:27):

And then with my background as someone who loves math and as someone who loves engineering and as someone who loves processes and platforms and giving people tools, I sort of, okay, well, I think we can figure out algorithmic bias out in a couple of months [crosstalk 00:11:42].

Jeremie (11:43):

It can’t be that hard.

Joaquin (11:44):

Yeah, it can’t be that hard. How hard can it be? I thought, let’s take a look here. So, I looked at some of the work, some of the definitions. And then of course, immediately it became clear. This is work by Arvind Narayanan, who gives this beautiful talk, 21 definitions of fairness and their politics. There’s a talk he gave about that in 2018. At the time it was called a FAT Star conference. Now it’s called FACT. [inaudible 00:12:11] community were great at coming up with terrible names like [NIPS 00:12:14] and FAT. And then luckily we renamed them.

Jeremie (12:18):

Move fast and break things.

Joaquin (12:19):

Move fast and break things. So, I thought, okay, well, that’s fine. Out of these 21 definitions, some of the definitions of fairness try to equalize outcomes between groups, some of the definitions of fairness try to focus on treating everyone the same, and I’m like, okay, how many do we really need? And it’s like, maybe we can implement a couple. I imagine like a drop down box like, okay, which one do I pick as a [crosstalk 00:12:44] practitioner? I thought we can have beautiful visualizations of data composition broken down by subgroup, accuracy of your model, calibration curves, all these things. I’m like, okay, we got this we, we can get this done in a couple of months.

Joaquin (12:59):

Then it hit me like a ton of bricks that AI fairness is not primarily an AI problem, and that math does not have the answer. And that it’s extremely context dependent what definition of fairness you should use. That you need to build multi-disciplinary teams that involve people who come from moral and political philosophy. That’s actually extremely important. And probably the most important one of all is that fairness is not a property of a model. It’s not a status box. It’s actually a process. Fairness is a process. In fact, I’m going to tilt the camera a little bit and show you … I intentionally am wearing this T-shirt we made for the team that says, fairness is a process.

Jeremie (13:45):

Ah, very nice.

Joaquin (13:46):

So, this is a set of T-shirts we built for the team because I kept repeating this all the time. And so our executive assistant for the team one day came up with this pile of T-shirts for everybody and says, “Okay, you all keep saying all the time fairness is a process [inaudible 00:14:03].” So anyway, we wear them with pride. You’re never done. The thing that maybe is a little bit hard for an engineer to think about is that this is not like, oh, I’m going to build this microphone here and it’s done. I’ve tested it, it works. It’s not like that.

Joaquin (14:20):

And if you go back in time, I guess humanity has been discussing fairness since Aristotle. And we’re still discussing, and we don’t agree. And it becomes political as it should because we have different ideologies. So, you need to build this processes that are multidisciplinary that involve risk assessments, surfacing the decisions, documenting how you make them and all of that. And that’s only for fairness. Obviously, in Responsible AI you have many other dimensions to tackle as well.

Jeremie (14:54):

I love the idea of fairness as a process partly because, well, there’s this well-known principle called Goodhart’s law. As soon as you define a metric, you say, we’re going to optimize this one number, then all of a sudden people find ways around it, they find hacks, they find cheats. I’m arguably seeing that with the stock market today.

Jeremie (15:12):

The time was back in 1950 as a general rule, if the stock market went up, people’s quality of life went up. But now we’re seeing a decoupling as people start to play games and politics and optimize around it. So, this seems like … I mean, is this part of the strategy is recognizing the fact that there’s no single loss function, no single optimization function to improve and instead look like let’s focus on the process?

Joaquin (15:35):

Absolutely. I have several things I’d like to say about this. First of all, every single metric is imperfect. Even if you’re not working on fairness specifically, the moment you set a metric goal, and in particular, I think one of the reasons that many technology companies are very successful is that I think we can iterate fast, we can be very quantitative, we can be metrics-driven, and we can move those metrics.

Joaquin (16:11):

But that at the same time is also potentially our Achilles heel that is what can actually get you into trouble. So, it’s essential to develop very strong critical thinking. And it’s essential to develop what we call counter metrics, like other metrics that work in a position which you should not make worse as you make this one metric better. But even that is-

Jeremie (16:36):

[crosstalk 00:16:36] an example of one of those actually? That sounds like a fascinating idea.

Joaquin (16:39):

A simple idea would be, imagine that you are trying to improve the accuracy of a speech recognition system and you’re measuring it by word error rate or something like that, whatever you want. So, one counter metric you could build if you cared about fairness is you could actually break down, you could desegregate your metric, say by, in the U.S. you could do this by some geographical grouping since accents vary across the country. And then your counter metric could be that you cannot decrease performance for any group because averages will always get at you, right?

Jeremie (17:21):

Yeah.

Joaquin (17:21):

You can make things twice better for the majority group, you can make them worse for another group, and you might not even realize. So, that’d be a very fairness centric metric. But you could imagine many other scenarios where you’re trying to optimize many aspects of an experience, and there are certain things that you don’t want to make worse.

Jeremie (17:51):

I think one of the first comments you made about this was fairness or definitions of fairness are not a challenge for the data scientists.

Joaquin (17:58):

Right.

Jeremie (17:58):

That’s not something they should have to worry about, which makes total sense. Data scientists don’t have time to get philosophy degrees and degrees in public policy. So, it’s going to be someone’s job ultimately. And whether that’s a policy maker or somebody with more of a liberal arts background, one question that comes to mind is the character of conversations around these metrics gets really deeply technical really fast. And I imagine just being able to communicate the nature of the decisions being made, that’s its own challenge. Do you see that coming up a lot, and are there any strategies that you use to address that challenge?

Joaquin (18:32):

It’s a challenge. One of the things we’ve been having success with is to define a consistent vocabulary for talking about fairness across the company and also a consistent set of questions that are increasing in their complexity to make this very concrete and illustrated with examples. One of the most fundamental definitions of fairness you can have is to ask the question, does my product work well for all groups? The question of defining groups is very context dependent as well of course.

Joaquin (19:18):

To my left I have our Facebook Portal camera, and it has a smart AI that is capable of tracking people. If I’m video conferencing with my parents in Spain or my parents in [inaudible 00:19:36], Germany, which is basically now you can imagine how our weekends go with our kids or we’re videoconferencing with them, we’ve actually given them some of those devices too because if you have your phone that you’re using to Skype or to FaceTime or to use whatever software you use, it’s really annoying to try to keep the frame centered and so on.

Joaquin (19:56):

And so with Portal it’s magic, but the AI that tracks people and that sees if you have two people to zoom it out, crop, and center and everything, you can’t take for granted that it’s going to work equally well across skin tones or genders or ages. In that context, you can say, well, how do I define a minimum quality of service? And a minimum quality of service is going to be that the failure rate where someone gets cropped out or something has to be smaller than a certain small percent of the time or of the number of frames or sessions or something.

Joaquin (20:34):

And that’s a reasonable concept, this idea of minimum quality of service. I want my product to work well enough for everybody. Of course, within the context of my product, one of the key questions is, hey, given what you’re trying to build, who are the most sensitive groups you should consider? Later on, if you want, I’ll put something in the topic jar, which is we can talk about election interference. And we can talk about how AI can help in that context. And we can talk about the India elections. And we can then reason about what groups are relevant in that context.

Joaquin (21:12):

But back to the Portal example, the most basic set of questions about fairness would be, hey, can you desegregate your metrics? Teams have typically launched criteria. So the data science, the ML team will say, “Hey, is this model good enough?” Well, the question is, instead of having a roll-up metric desegregated by group, in this case, it could be skin tone, there’s many scales you can use for this, age and gender, and then for each of those buckets, look at the bucket for whom the performance is worst and imagine that that was the only population you had, would you still launch?

Joaquin (21:50):

If the answer is no, then maybe you need to sit back and figure out, okay, how do I make sure that I’m not leaving anybody behind. Then leading to the India elections, there is one set of questions and one definition of fairness that is more comparative in nature. In a way, minimum quality of service compares against the bar, but it doesn’t necessarily then worry about, oh, it may work still a bit better for a group than another. It’s fine as long as it works well enough for everybody. And I’ll put another topic in the topic jar, which is Goodhart’s law and how even you can game or screw up your minimum quality of service as well.

Joaquin (22:30):

But the next level up would be equality or equality of treatment. You could say, okay, well, I actually care about differences in performance. And the India elections example goes like this. Last year around April, May, if I remember correctly, we had the biggest elections in the history of humankind. I think it was close to a billion registered voters or something along those lines. A little bit less than that, but massive. And in a time where there are many concerns of misinformation and attempts of manipulation.

Joaquin (23:08):

The way we would tackle this without AI would be to have humans who are given standards, community guidelines, content policies, and who check that content doesn’t violate those. And if it does, you take them down. But the problem is at the scale of this election and at the volume we have, there’s no number of humans you can hire to do this. So then what you need to do is you need to prioritize the work because obviously, my post about me playing the guitar or featuring my cat or my dog, we shouldn’t waste time reviewing those.

Joaquin (23:47):

But if there are issues that politicians are discussing, or even if it’s not politicians, organizations are talking about civic content, content that discusses social or political issues, then we should prioritize those. So, we built an AI that does just that. We call it the civic classifier. So, what it does is it sifts through content and it identifies content that is likely to be civic in nature, and then prioritize that for review by humans.

Joaquin (24:20):

What’s the fairness problem? Well, the fairness problem is that in India, you have, I think it’s over 20 official languages, and there are many regions that have very different cultural characteristics. So really, if you break down your population into regions and languages, you can see how those actually correlate with things like religion or caste. And the social issues that concern people in different regions are different. But now you have an AI that’s prioritizing where we put those human resources to review content. What happens if that AI only works well for a couple of languages and it doesn’t work for others?

Joaquin (25:01):

So, if you take the language analogy, what happens if it overestimates risk for a language? Meaning we would prioritize … we’d put more human resources to review posts in that language. And if you under predicts or underestimates the likelihood that something is civic in nature for another language, then we’re not allocating enough human resources there.

Joaquin (25:22):

And then we decided not to stop by minimum quality of service, we thought, this is a case where we really want equal treatment. And since what we have here is a binary classifier that outputs a number between zero and one that says, what’s the probability that this piece of content is civic, then what we said is well, we want those predictions to be well calibrated for every single language and every single region.

Joaquin (25:47):

So again, the notion of fairness there, we set a higher bar because this was about fairness of allocation of resources. And then just to complete the picture, the third common definition and sets of questions associated that we’re also using internally is around equity and is around asking, not just should we treat everybody and everything the same, or should the performance of the system be above a certain bar for everybody? The question then becomes, is there any group of people, for example, for whom the product outcomes deserve special attention? And this can be because there might be some historical context that we should take into account.

Joaquin (26:42):

And one example we’ve seen in 2020 in the wake of racial justice awareness in the U.S. has been the focus, for example at Facebook, on business outcomes for Black-owned businesses. So, we’ve built product efforts that help give visibility to Black-owned businesses because we felt that the Facebook platform has the opportunity to help improve a situation that was already there.

Joaquin (27:17):

Many times the question when you think about fairness too is one of responsibility. It’s like, okay, people may say, well, but society is biased, and here I am, I’ve built a neutral piece of technology, I don’t have skin in this game. And I think what we’re saying is it doesn’t work that way. Your piece of technology can reflect these biases, it can perpetuate them, it can encapsulate them inside of a black box that legitimizes them or that makes them very difficult to debug or understand. And then on the flip side, technology can also make outcomes better for everybody. But sometimes we have to ask ourselves, is there any group that we’re going to prioritize? That was a lot of words.

Jeremie (28:03):

No. I think it’s hard to answer the question, how is fairness managed at Facebook in a Twitter [inaudible 00:28:10].

Joaquin (28:09):

Right.

Jeremie (28:10):

But I think it’s really interesting the different kind of lenses and different approaches to fairness you’re describing. From a process standpoint, one of the things I’d be really curious about is, where within the organization does the decision with respect to which standard of fairness to apply come from? And do you think that the process for making that decision is currently formalized and structured well enough? Obviously, there’s going to be iteration, but do you feel like you have a good sense for how that process should be structured?

Joaquin (28:42):

Yeah. So, we’re building the process. But what was very clear to us is that it’s a hub-and-spokes model. So, we have a centralized Responsible AI team that is multidisciplinary. It has, as I mentioned earlier, not only AI scientists and engineers, but it also has moral and political philosophers and social scientists. And what that team does is the whole layer cake that ranges from providing an ethical framework for asking some of these questions, so that would be the qualitative part, if you will, then going more into the quantitative part, which is like, hey, how do we …

Joaquin (29:26):

Now you talked about equality versus minimum quality of service. And for equality, you talked about binary classifiers. What does that actually mean in math? What do I look at? Do I look at false positive rates between groups or something like? Answer, no, we don’t necessarily because that has problems. But what do you do?

Joaquin (29:48):

And then there’s another layer beneath this, which is more the platform and tooling integration. Again, leveraging all of the work we had done in Applied ML, how do we make it really easy and really friendly to be able to break down the predictions of the algorithm by groups, et cetera? So, you have this entire layer cake including things like office hours, which are actually really important. It is centralized.

Joaquin (30:15):

And then what you see is that every product team is building their own embedded expert group that will collaborate very closely with a centralized team. And the reason you have to do it this way is that you need the centralized team to have consistent practices, consistent terminology and definitions, but the decisions, to your point earlier, need to be made in context.

Joaquin (30:45):

And so it would be impractical and inefficient to have the centralized team make hundreds of decisions that are very context-dependent. And then the biggest thing is because we said, it’s a process, you don’t make the decision once and you’re done. You keep at it. Right?

Jeremie (31:04):

Yeah.

Joaquin (31:06):

Just as a side note, I watched the movie, Coded Bias, by Shalini Kantayya. It features Joy Buolamwini and Timnit Gebru and Cathy O’Neil and many other brilliant researchers on Responsible AI. And I think it must be really dangerous to try to cite people live and then get the person wrong.

Joaquin (31:32):

One of them, I’m going to keep it at that level of abstraction because I don’t remember who said it, one of them said something like, “Many of these Responsible AI practices, like fairness, they’re like hygiene. It’s like you have to brush your teeth every day.” And so that’s how I think about it. And you have to do it in context of your product. So, we’re seeing an increasing number of product teams build their own embedded effort, and we collaborate very, very closely with them.

Jeremie (32:04):

And how do you decide … So, there’s an almost economic aspect of this that I’m fascinated by. I think I saw Amanda Askell, one of OpenAI’s AI policy specialists, she tweeted about this idea that there’s a motif right now in the AI ethics community that if I can find a reason that an algorithm is bad, then that algorithm simply should not be deployed. There’s that vibe in the zeitgeists to some extent where people will say, “Oh, well, this algorithm discriminates in this marginal way.”

Jeremie (32:33):

And of course, we know all algorithms will inevitably discriminate in some way. So, there’s got to be some kind of threshold of some sort where we say, okay, this is too much, and this is not enough. How do you think about, I guess the economic trade-off from that standpoint? I mean, you could try to debias algorithms until the cows come home, but at some point, something’s got to get launched. So yeah, I guess, how do you think about that trade-off?

Joaquin (32:59):

I think it’s extremely difficult to come up with a purely quantitative answer to that question. I mean, again, as a math and engineering person, I’d love to be able to cost function where I say, if I’m reducing hate speech using AI, the cost of a false positive is $40 and three cents. And so that’s very hard. I think what that touches on, I think is on a few things. It touches, I think on transparency, and it touches on governance.

Joaquin (33:40):

And I think where I believe things are going to go is that the expectation is going to be that when you build AI into a product that you are very transparent about the risk benefit analysis, there’s a phrase that I’m using these days and I’m probably stealing it from someone and I don’t know, this idea of AI minimization, this idea of … In the same way as people talk about data minimization. So, AI minimization. Instead of my instinct being let’s use AI because it’s cool, it should be like, can I do it without AI? And then if the answer is maybe, but then if I put AI, it adds a ton more value for everybody, it’s like, okay, well, that’s justified, I have a value.

Joaquin (34:33):

And then comes the flip side, which is all the list of risks. And one of the risks may be discrimination, or it may be other definitions of unfairness. And I think being transparent about those trade-offs and having ideally like a closed loop mechanism to get feedback and to adjust and see what’s acceptable, I think this is an inevitable future that we will be in.

Joaquin (35:06):

An example that I use a lot is not an AI example, but again, I’ll say it again, many of these problems are not familiar AI problems. We think about content moderation and the difficult trade-off between giving everybody a voice and allowing for freedom of speech, but on the other hand, protecting the community from harmful content. And there is misinformation that is obviously harmful. We’ve come to the firm conclusion that we neither can or should do this ourselves or on our own. And when I say we, I’m of course speaking as a Facebook employee. But I think this is generally true across technology.

Joaquin (35:55):

We’re experimenting with this external oversight board. Maybe you’ve heard of it. It’s a group of experts. It’s taken a lot of work to the team that put them together to come up with the bylaws and how they’re going to operate. But let’s focus on the spirit of it for a second. The spirit of it is that there will be decisions that are tricky. No content policy is ever going to be perfect. And so it’s essential to have humans in development to have a governance that includes external scrutiny and expertise. I think at the end of the day, it’s going to be this interplay of transparency, of accountability, of governance mechanisms that are participatory. Right?

Jeremie (36:45):

Yeah.

Joaquin (36:46):

If you zoom out and you think about AI in 10 years being very prevalent, being very consequential, one big question is going to be one of participatory governance, right?

Jeremie (36:58):

Yeah.

Joaquin (36:59):

How do you or I have a say in where the AI that drives our relatives or our pets to the vet, I mean, that would be pretty phenomenal, like some sort of self driving-

Jeremie (37:11):

And that itself, it’s almost as if it imposes this sort of like energy tax on the whole process where Facebook … I think it’s great that Facebook is doing this. I think it’s great that Facebook has a responsible AI division. And to some degree, it makes perfect sense that it can only happen at scale today because it’s only at scale that companies like Facebook and Google and OpenAI can actually afford to build giant language models and giant computer vision models. And this leads … I wish we had more time for this, but I do have to ask a question about Partnership on AI.

Joaquin (37:43):

Of course, yeah.

Jeremie (37:43):

Because it’s such an important part of this equation. One of my personal concerns is actually, we’ve talked a little bit about this here, but just the idea that there really is safety, privacy, fairness, ethics, all this stuff, they do take the form of what is effectively a tax on organizations that are competitive. I mean, Google has to compete with Facebook. Facebook has to compete with everybody else. It’s a market.

Jeremie (38:06):

And that’s a good thing, but as long as there’s a tax for fairness, as long as there’s a tax for safety, that means there’s an incentive to have some kind of a race to the bottom where everyone’s trying to invest minimally in these directions to some degree. And at least the way I’ve always seen it, groups like Partnership on AI have an interesting role to play in terms of mediating that interaction, setting minimum standards. I guess, first off, would you agree with that interpretation? And second, how do you see Partnership on AI as a bigger role as the technology evolves? Because I know this is a big thing, but I want to make sure I get your broad thoughts on [crosstalk 00:38:40].

Joaquin (38:39):

No, and I’m glad that you brought up Partnership on AI. I am on the board and have been very involved for the last couple of years, and I think it is an organization that the world needs. But I would love your help with understanding the logic behind the race to the bottom hypothesis. And maybe we can make it very concrete by picking any dimension of responsibility that you want. I’m just struggling a little bit. I want to make sure that I really understand what you mean.

Jeremie (39:14):

Yeah, yeah, definitely. So, I’m imagining … Actually, I’ll take an example from the alignment to AI safety community. So, one common one is a hypothesis, I don’t necessarily buy into this, but for concreteness, let’s assume it’s true, that language models can be scaled arbitrarily well and lead to something like super powerful models that are effectively Oracles or something like that. So, they’re essentially unbounding the amount of stuff you can do with super large language models.

Jeremie (39:43):

And to the extent that safety is a concern, you might not want an arbitrarily powerful language model that can plot and scheme. Like you can say, “Hey, how do I swindle like $80,000 from Joaquin?” And the language model goes, “Oh, here’s how you do it.” And yet, companies like OpenAI are competing with companies like DeepMind or whoever else to scale these as fast as they can. Whoever achieves a certain level of ability first, I mean, there’s a decisive advantage in even like six months of the time or whatever.

Jeremie (40:16):

There are analogs for this across, you got fairness, and bias and so on, but there’s always some kind of value that’s lost any time you engage in the company to company competition where you could invest the marginal dollar in safety, or you could invest the marginal dollar in capabilities. And that trade-off is something that … I’m wondering if Partnership on AI could play some sort of role in mediating between organizations setting standards that everybody at least thinks about adhering to, that sort of thing.

Joaquin (40:44):

Yeah. Thank you. I understand the question. The question makes me think a little bit about environmental sustainability in a way, where you could have the same race to the bottom in many ways, not only between companies, but even between countries. It’s like we have one planet. Who’s going to commit to cutting CO2 costs? Like, oh, well, just a second and I’ll be back in a moment.

Jeremie (41:10):

Just after this tree.

Joaquin (41:13):

Exactly. I think that Partnership on AI plays a key role on many fronts. First, being a forum, a very unique forum that brings together every single type of organization you can think of, academia, civil rights organizations, civil society, other kinds of not-for-profits, small companies and big companies. And some of these organizations on Partnership on AI are going to be actively lobbying and shaping and informing upcoming regulation. So, I think there is already a bit of pressure there somehow for big companies to be involved and to lead the way because failing to do so may mean that you have to deal with regulation that is bad for everybody. So, I feel like that’s one angle.

Joaquin (42:16):

The other angle is efficiency. Some of that work that you described, some of that dollar that I could either invest in making sure that my big language models are human value aligned and that they’re not evil, well, if you have an organization that is supported by a large number of for-profit companies and they do the work, projects like ABOUT ML is a project that I love that builds on documentation practices for data sets and for models, so it’s a combination of ideas, like datasheets for datasets and model cards and things like that, these are extremely powerful.

Joaquin (43:04):

And we at Facebook are definitely paying a huge amount of attention and experimenting with that. And I’m glad that we can use something, A, relatively off the shelf, but very [crosstalk 00:43:16], that we didn’t have to build ourselves, and for extra points. That gets the validation and credibility that you get from it being built by an organization like Partnership on AI, that doesn’t represent just our interests. And I think that that’s going to be a common pattern.

Joaquin (43:33):

That thing of, A, the 360 view that you get by having a multi-stakeholder organization that has everybody. Second, just the efficiency and convenience of having best practices and recommendations that come from people who really know what they’re doing, and where you know that everybody was looking while this was being created. And then the third one is really the legitimacy and validation that you get by using something that, hey, these are not like the ethical principles of Facebook. Right?

Jeremie (44:06):

Yeah.

Joaquin (44:06):

I mean, I would understand if people frowned and people were like, “Oh well, I mean, what interests are represented there?” Only Facebook’s, versus it coming from The Partnership on AI.

Jeremie (44:17):

Yeah. I mean, it’s one of those organizations where I’m hoping to see a lot more of them in the future. And it seems like the kind of thing that the nascent kind of kernel of that organization is hopefully going to keep blossoming because as you say, we need some kind of multi-partisan oversight of this space at so many different levels. And it’s great to see the initiative taking shape. Joaquin, thanks so much for your time. Actually, do you have a place where you like to share stuff, like Twitter in particular, or a personal website that you’d like to point people to if they want to find out more about your work?

Joaquin (44:49):

Oh, I have a confession. So, I’m a late Twitter adopter.

Jeremie (44:58):

Oh, that’s probably good for your mental health.

Joaquin (45:03):

But I now feel a strong sense of responsibility to be contributing and engaging in the public debate [inaudible 00:45:14] might be. I’m going to embarrass myself so much right now by having to even look up my Twitter. On Twitter, I am @jquinonero. So, it’s letter J and then my first last name. It’s complicated. I have two last names because I’m from Spain.

Jeremie (45:33):

We can share it.

Joaquin (45:34):

We have a … Exactly. Yeah, yeah. I have made a personal commitment to engage more, and I’m going to have to learn how to do this in a way that doesn’t disrupt my mental health like you were saying.

Jeremie (45:48):

Yeah. I think that’s one of the biggest challenges, for some reason, especially I find with Twitter is keeping that mindful state and not getting lost in the feed. Boy. Anyway, we’re in for a hell of a decade. But really appreciate your time on this. Thanks so much, and really enjoyed the conversation.

Joaquin (46:05):

Thank you so much, Jeremie. It’s an absolute pleasure to be here. And thank you for doing this podcast, and thank you for getting everybody involved in these kinds of conversations.

Jeremie (46:17):

My pleasure.

Mapping Black-Owned Businesses with GeoPy and Folium

Mapping Black-Owned Businesses with GeoPy and Folium

A public, interactive map to explore Black-owned businesses in the Greater Boston Area

Image for post

Introduction

The COVID-19 pandemic has taken a serious toll on small businesses across the United States, and Black-owned businesses continue to be hit the hardest. In the past year, many have watched the escalation of ongoing economic and racial justice crises compound with a crisis of public health. Calls to support local and Black-owned businesses can hardly begin to accomplish all that’s needed to put an end to these concurrent crises; such calls can, however, be a small step in the right direction toward narrowing the racial wealth gap and repairing local economies.

The map discussed in this article is intended to be a resource for people interested in exploring Black-owned businesses to support in the Greater Boston area, and for those who wish to create similar maps for other regions. The map uses data from this crowd-sourced spreadsheet of Black-owned stores, restaurants and services in the Boston area. It should be noted that not every entry in the spreadsheet has been confirmed to be a Black-owned business, so the map may include both confirmed and unconfirmed entries.

Cleaning the Data

The data from the above spreadsheet document was read into three separate pandas data frames: ‘restaurant,’ ‘store,’ and ‘service.’ The columns of the data frames were renamed where necessary such that each one contained a ‘Name,’ ‘Address,’ and ‘Website’ column. The final service data frame also contained a ‘Category’ column, which allowed for simple subsetting by service category in the mapping stage.

Duplicate names were removed, so that one marker could be added to the map for each business. Many multi-service businesses had duplicate appearances in the service data frame. Only the first occurrence of each business was kept, which means that one limitation of the map is that its markers may not capture the variety of services offered by some businesses.

In the case of the restaurant and service data frames, address data was originally spread across multiple columns. Components of each address were concatenated into a single column with with the Series.str.cat() method, as in this example:

Obtaining the Geographic Coordinates

Once a full address column was ready for each data frame, geographic coordinates were obtained with GeoPy. The GeoPy client allows developers to retrieve coordinates given an address, and vice versa. The below function obtains coordinates for an entire column of addresses by iterating through the column and appending the latitude and longitude to ‘lat’ and ‘lon’ lists. The client returned NoneType objects for certain addresses that it could not obtain coordinate data for. To account for these cases, the if/else statement in the function makes it so that coordinate data is obtained only for a geopy.location.location object; for NoneType objects, “NA” is appended to the list.

Adding Markers to the Map

The above function was used as a helper function in add_markers(), which adds markers to a map in Folium. The location parameter in folium.map() takes a set of geographic coordinates, which marks the default center location of the map. The tiles parameter affects the style of the map’s background, and the zoom_start parameter refers to the maps’s initial zoom level, where higher numbers create a closer zoom.

The add_markers() function takes a data frame with an ‘Address’ column, a marker color, and a marker icon as arguments. It retrieves the coordinate data for the addresses with get_lat_lon() and filters out rows with missing coordinate data. The points to be added to the map, and their corresponding business information, are stored into a list of zip objects. Each object, aliased as ‘p’ in the for loop, contains the following data:

  • p[0] and p[1]: latitude and longitude, respectively
  • p[2]: Business name
  • p[3]: Business address
  • p[4]: Business website

The final element in the zip object for each business, website, may contain null data if a website was not recorded in the spreadsheet. The if/else statement in the function adds only the business name and address to the marker’s popup text in these cases; if a website is available, it will also be included in the popup text.

In this function, the icon argument takes the same size dimensions for all markers. The markers’ color and icon will vary based on what gets passed into the add_markers() function. Icons used in the map were accessed via Font Awesome, whose prefix keyword argument is ‘fa’ in the icon argument. The below table shows different types of markers that have already been added to the map and the business categories they correspond to:

Image for post

The add_markers() function was used on the restaurant and store data frames, and on subsets of the service data frame that were broken up by category. In the example below, rows that correspond to architectural services went into the ‘architects’ data frame. The function was then called on the data frame, and the markers were given a cadet blue ‘building’ icon:

Exploring the Map

Here is a preview of what the map currently looks like and how it can be navigated:

Image for post

The interactive map is hosted here. Feel free to explore it by zooming into a neighborhood of your choice and clicking on business markers to view their popup information.

Conclusion

This article reviewed my process for creating an interactive map of Black-owned businesses in the Boston area. To see the full code for this project, visit its GitHub repository.

You should also follow Backing Black Business, whose contributors plan to display Black-owned businesses across the United States on an interactive map. There is an option on the current website for Black business owners to submit information about their business for inclusion on the map once it’s launched.

Supporting Black-owned businesses goes beyond creating lists and maps for easy discovery. Resources like these only have real value if they are actually used to find Black-owned businesses to support with your dollar—both in these especially difficult economic times and beyond.

8 Crucial Mistakes Holding You Back From Getting a Programming Job

8 Crucial Mistakes Holding You Back From Getting a Programming Job

Number 6. You’re pretending to be more talented than you are

Photo Credit to Andrea Piacquadio via Pexels

If you’re reading this article, you’re well aware of the great benefits that come with a programming job — high salaries for programmers, an expanding job market, exciting opportunities.

You’re also aware that employers are increasingly desperate for seasoned, qualified, talented programmers. DAXX blog writes that in 2020, while there are 1.4 million unfulfilled jobs, there will be only 400,000 computer science graduates. Of course, more hobby coders or bootcamp programmers might fill in some of those ranks, but overall, the demand for the jobs far outweighs the supply of programmers.

That’s what makes it all the more frustrating when you count yourself as one of those skilled programmers, but you still can’t get a programming job.

Nowadays, so much online material that’s designed to help you learn coding or programming emphasizes getting you a job — fast. But while these shortcuts can help you get started on the journey of getting a programming job, they’re also part of what’s stopping your career progression in its tracks.

If you believe you have the necessary skills but you still can’t get a programming job, you might be falling into one of these eight missteps. The good news is they’re all fixable. At the end of the day, the only prerequisite to getting a coding job is the desire to get a coding job. If you’ve got that, everything else is within your grasp.

Table of Contents:

1. You haven’t mastered the fundamentals of computer science.
Why this means you can’t get a programming job
How to solve this problem
2. You’re not presenting yourself in a way that demonstrates you’re a good culture fit
Why this means you can’t get a programming job
How to solve this problem
3. You’re ignoring good interview skills
Why this means you can’t get a programming job
How to solve this problem
4. You don’t have experience
Why this means you can’t get a programming job
How to solve this problem
5. You’re trying to master everything
Why this means you can’t get a programming job
How to solve this problem
6. You’re pretending to be more talented than you are
Why this means you can’t get a programming job
How to solve this problem
7. You haven’t demonstrated you want to learn
Why this means you can’t get a job
How to solve this problem
8. You’re ignoring automated filters
Why this means you can’t get a programming job
How to solve this problem
Final thoughts on why you can’t get a programming job

1. You haven’t mastered the fundamentals of computer science.

Many people who learn coding at a bootcamp, instead of focusing on learning computer science online, skip steps, and don’t take the time to do it properly.

These bootcamps are designed to get you a very specific set of skills in a certain range of time. What they’re not designed to do is teach you the underpinnings of computer science — algorithms, computer architecture and hardware, data structures, databases, and computational theory to name a few.

I don’t mean to imply these courses are trying to trick you. Their purpose is to teach the bare-minimum of coding skills required for most entry-level positions. It just so happens that purpose doesn’t align with the goal of getting you a programming job, which is more complex.

As more educational resources crop up on the internet, the possibility of getting a coding job without a traditional degree becomes more and more likely. This means that more computer science beginners are taking an eight-week boot camp in Python, which is easy for beginners to learn, and getting frustrated when they can’t get a programming job immediately after.

Why this means you can’t get a programming job

Let’s continue our thought experiment and say you truthfully say you can write Python code on your resume. This may get you in the door for an interview at an exciting startup — your dream. In the interview, they’ll ask you a basic question about algorithms and you’ll be completely stumped.

While these job applications may not outright say “needs to understand the basics of data structures,” this is because it’s implicit. The traditional route of learning computer science teaches you basics before even getting into any languages, which helps you interpret and apply the skills you’ll learn later more effectively.

The people who’re hiring you aren’t just looking for a Python savant. They want someone who can do the job in its entirety. They will be able to tell in an instant if you have the basic understanding that’s necessary to do the job, or if you memorized Python code. This is a common misstep that may mean you can’t get a programming job.

How to solve this problem

If you’ve already spent time and money on your bootcamp or course, you don’t have to go back to school to make the most of your investment. Instead, compile a list of the basics and study them. This will help you leverage your existing knowledge in whatever language or course you took, without getting tripped up on the essentials.

There are many resources online, both paid and free, that can help you learn the building blocks. Whether you get the basics from a degree, course, or bootcamp, you can’t expect to do a single course focusing on one niche aspect and get a job.

2. You’re not presenting yourself in a way that demonstrates you’re a good culture fit

Personally, part of the reason I ended up going for a data science job out of college was because I hated sales, like many other people. But while everyone hates sales, when you’re applying for jobs, you have to know how to sell yourself at every stage of the application process. If you can’t get a programming job, it may be because you’re neglecting this part of the application process.

The stereotype is that companies that want to hire you for a coding job only care about your technical chops, but it isn’t true. Along with the languages you can code in, the computer science fundamentals you understand, the interview questions you ace, they’re also going to be looking at a crucial aspect: culture fit.

Even though the job market is wide open, employers want someone who not only is good at coding, but will make a good addition to the team and company as a whole.

Why this means you can’t get a programming job

Programmers have a stereotype of being the type of obsessive individuals who lock themselves into a cold, dark basement until their coding project is done. Charlotte Bone, a full-stack web developer, wrote in her blog post on the subject that the idea that “programmers love nothing more than to be sat in a dark room coding all day and night,” is a harmful stereotype.

It might be an obstacle because though you want to be a programmer, you might not feel you meet that stereotype but still feel like you have to pretend to be that for an employer. Or, if you do meet that concept, you might show that part of yourself off but forget that employers aren’t looking for coding machines — they’re looking for employees. If you can’t get a programming job, consider how you’re presenting your personality and how it fits with your target company.

How to solve this problem

The truth is that most people, including programmers and coders, aren’t coding machines — we’re real people with interests and hobbies outside of writing lines of code. The important thing is to allow that to surface in your resume, and during your interview, in a way that makes it clear you’re a great culture fit as well as technically competent. If you can’t get a programming job, it might be because your resume only looks at programming skills.

Do you show passion, engagement and curiosity? What extracurriculars have you done? How can you package your skills to demonstrate you’re what they need both from a technical and cultural perspective?

Take time to research your potential future employer and what non-technical qualities current employees show. This will give you the strongest chance of presenting yourself favorably from both a skills perspective as well as a culture fit.

3. You’re ignoring good interview skills

In a similar vein, if you can’t get a coding job, it might be that you’re ignoring common and vital interview skills.

Even if your resume is perfect, even if you’re nailing the questions at the interview, it’s still critical to remember you’re going to be judged as a human, not a machine. Maintaining eye contact, expressing self-confidence, and soft skills like forming a connection with your interviewer all still matter.

Why this means you can’t get a programming job

Some people who can’t get a programming job are fed up because on paper, they have every single skill necessary to get the job done. But interviews are also a test of how you work with other people and how well you can communicate, which are often overlooked.

Humans aren’t great at telling at a glance if you’ll be the kind of person who works well with others, but we’ve come up with some shorthand that gives us an indication. That shorthand is common interview etiquette.

Laurie Hoss, CTO of npm, wrote in Quartz that, “the job of an engineer is to work with a team to achieve something larger, and if you are unwilling or unable to spend time communicating with your colleagues you’re only doing half of your job.” If you’re not showing off good people skills in your interview, your would-be employers might think you fall into that category of coders who can only do half the job.

How to solve this problem

If you’re getting to the interview stage and still can’t get a programming job, it might be a sign that this is an issue affecting your job prospects. To solve it, there’s an easy checklist to do during every interview, that is agreed-upon no matter if virtual or in-person:

  • Make eye contact
  • Ask your interviewer about something unrelated to the job — their family, their plans for the afternoon, their pets.
  • Exude confidence. Remember you applied for this job because you think you’re the best candidate for it!
  • At the end of the interview, make a passing reference to what you spoke to them at the start.

This shows interviewers that you’re not just a superb coder, but you’re a good interviewer, and hence a good communicator.

4. You don’t have experience

Don’t blame yourself too much if you can’t get a programming job for this reason. This is a problem compounded by the job listings posted by employers. It seems like every employer needs coders who have at least five years of experience in a language that only came into creation one year ago. This leads to coders applying for jobs that might be considered a bit of a stretch in terms of their applicable experience.

Screenshot of tweet

Because standards are high, a little bit of exaggeration can be forgiven. But the problem arises when you’re applying to jobs that ask for experience, and you don’t prove where your experience comes from. This is a common reason even skilled coders can’t get a programming job.

The issue is that years of experience don’t mean anything. I could be the laziest coder in the world, with an alleged five years’ experience writing Perl, because my sister owns the company and she didn’t fire me. I’d have the exact same skill set as someone with just one month of experience, but who had taken her job much more seriously than me.

Like the good interview skills, asking about years of experience is just job application shorthand for “Do you know how to do 75% of the things we’ll need you to do?”

Why this means you can’t get a programming job

Imagine someone looking through stacks of resumes where they asked for coders with five years’ experience in Python. They see you, like everyone else in the stack, have listed that you have the requisite five years.

From the outside, it looks like you’ve answered their question. But what they’re asking is whether you have the skills and experience needed to solve their Python problems. Whether you’ve done one year or five years of Python experience is immaterial at this point — you need to show that you have the equivalent of those years of experience. This not only will help you answer the employer’s actual question, but it’ll help you stand out, too, among the other applicants.

How to solve this problem

No, you don’t need five years experience. But if you can’t get a programming job, you need to demonstrate that you care about your future job, especially if you have no prior work experience. What projects have you done for fun? What have you most enjoyed about it? What problems did you solve?

Do you have a blog, GitHub repo, or another portfolio where you can demonstrate your commitment to programming? Nathanaël Cherrier, Lead JavaScript software engineer at Ferpection, lists some advantages in his blog post on why developers should start blogs: “When you write on the Internet you become more visible than a regular developer. Who are you hoping will read your post? Future colleagues? A recruiter from that awesome company you’d like to work at? The committee responsible for choosing the speakers of a conference you’d like to talk at? All these people will be interested in both your technical skills and your editorial skills.”

If you can’t get a programming job because of a lack of experience, sharing your passion can be a great way to prove to employers that you can do what they need you to do.

5. You’re trying to master everything

Again, like the unrealistic requirements in years of experience, many companies contribute to this issue by listing just about every potential language and technical skill that might one day come in useful on their job listings. If you can’t get a programming job, it might be because you are trying to master everything you see on job listings.

The issue is that there are countless programming languages and skills that a newbie coder might think they have to learn.

Instead, along with learning the basics that underpin every core computer science job, you need to market one sellable aspect of yourself instead of trying to do everything.

Why this means you can’t get a programming job

Unless you’re a veteran programmer, you can’t master everything that jobs will list under their requirements. (And if you’re a veteran programmer you probably have a decent job already!)

If you’re at the beginning stage of researching jobs and worried you can’t get a programming job, you might believe you have to master everything they ask for. Instead of applying yourself to one or two key skills, you spread yourself thin to have a passable knowledge of anything that might come up in an interview. But instead of becoming a master in eight languages, you’ll just end up knowing very little in each.

How to solve this problem

I loved the way Teresa Dietrich described her solution in her blog post entitled “What I learned from hiring hundreds of engineers can help you land your next role.” In it, she writes that plenty of job listings have exaggerated requirements that seem to cover everything under the sun.

If you can’t get a programming job, her solution is to make a spreadsheet of the jobs that interest you, and the core skills that each one requires. Chances are you’ll spot some commonalities pretty quickly. This will give you your answer on the skills most likely to help you get that job, even if they list twenty other “requirements.”

Part of the reason you can’t get a programming job might be because when you are asked to do everything, you only can prove mastery of a very few things. By cutting through the noise and delivering the signal these companies are looking for, you can get the programming job of your dreams.

What language do you enjoy the most? Which one do you understand best? These answers will help begin to point you in the right direction — both in terms of what sorts of jobs meet your existing skill set, as well as how to market those skills on your resume and during your interview.

6. You’re pretending to be more talented than you are

Of course, people who can’t get a programming job might be desperate enough to stretch the truth in order to try to cover every point. With job listings as unrealistic as they can sometimes be, this is only to be expected. But it might actually be stopping you from getting a programming job.

The good news is that while employers do want a capable employee, this doesn’t mean they need you to do everything they list in the job requirements.

Why this means you can’t get a programming job

When you tailor your resume and cover letter, you try to meet those unrealistic requirements by stretching the truth to appear as though you’re a master in everything they ask for. Interviewers will be able to see through that on your resume, and even if you get the interview, they’ll definitely spot it then.

In the clearest terms, you won’t be able to fool experienced programmers and interviewers. While the job listings may seem unattainable, this doesn’t mean you should pretend to have more experience, knowledge, or skills that you don’t have. The best case scenario is you get hired for a stressful job you can’t do properly. The worst-case scenario is you can’t get a programming job and waste your time applying to a job you don’t qualify for.

How to solve this problem

Keep your future employers’ goals in mind. Like the years of experience, they don’t need you to tick every single box. They just want to hire the person who’s best able to get the job done.

By sticking to what you do know, both on the job application and in the interview, you’ll be able to play to your strengths. Be honest about your skills as well as your limitations. So long as you can demonstrate that you can do what they’ll need you to do, you’ll be in with a chance.

Consider an employer who meets two candidates: one who says they can do something that they can’t, and one who says it’s beyond their current skill set, but demonstrates how they’ve grown their skills over the past year. The latter is much more appealing to employers. If you can’t get a programming job, consider how to pare back your alleged skills to keep it as real as possible.

7. You haven’t demonstrated you want to learn

It’s interesting to note that LinkedIn Workplace Learning’s 2020 report shows that the most in-demand skills aren’t technical ones at all, but rather soft skills. The reason, they postulate, is because technical skills age quickly. A vital skill one year is redundant the next. Soft skills, like the basics of computer science, underpin every single other skill that might be attractive to employers.

That means a learning aptitude is more important than any other technical skills you can show. Most employers want to hire a candidate they’ll only have to minimally train because you can bet new technical competencies will be necessary year after year. To be an ideal candidate, once you’ve demonstrated your core skills and your main talents, prove that you’re still interested in learning. Programming is not a static career. New techniques, languages, and skills come out all the time. You need to be dynamic to stay on top.

Why this means you can’t get a job

Let’s assume you are a perfect candidate who has every year of experience, every language they ask for, and who can demonstrate a solid grasp of the fundamentals of computer science. If you still can’t get a programming job, even with everything in your favor, it may be because you haven’t demonstrated you want to learn.

If your resume doesn’t prove that you are still interested in learning new skills, and if you don’t show a passion for gaining knowledge at the interview stage, even if you’re the perfect candidate today, tomorrow you’ll be obsolete.

People who want a career in coding or programming might focus on the hard, technical skills because they’re easier to prove. But if showing a desire to learn is only an afterthought, this might be a reason you can’t get a programming job.

How to solve this problem

Luckily, most coders and programmers love to learn. You have to, especially if you didn’t get a traditional computer science degree. This is where a more unconventional background can come in handy — by having taken courses or gained certificates, that’s one great way to demonstrate your dedication to learning.

It’s also a good idea to brush up on the latest programming trends. You don’t have to show total mastery of them — and indeed, that would be a waste of your time — but by demonstrating interest in the trends of programming, you can show you enjoy learning and staying current in the computer science field.

Finally, don’t limit yourself to only showing a passion for learning computer science. What else do you enjoy learning? Instruments, spoken languages, watercolor techniques and more can all be ways to showcase your love of learning.

8. You’re ignoring automated filters

If nothing on this list of mistakes applies to you and you still can’t get a programming job, it might be because technology is working against you. Even though most people applying for programming or coding jobs are deeply technical, it’s easy to overlook the fact that the recruitment process is automated. Take Amazon as an example, who got into trouble back in 2018 when it turned out that their recruitment AI was mistakenly showing bias against women. Many qualified individuals didn’t even have their resumes looked at due to faulty automation.

Setting problematic AI aside, consider the fact that many times, your application doesn’t even get seen by HR or recruiters because they’ve applied filters to minimize their workload to a reasonable number of candidates. By using keywords to shrink the list, they do their best to winnow out the least applicable resumes before they even get looked at by a human.

Why this means you can’t get a programming job

Many talented programmers are not good at optimizing their resume for keywords. That is a skill in and of itself. While it may make HR’s job easier, it does mean that the reason you can’t get a programming job isn’t due to any deficiency of yours, but because you just didn’t put the right words in the right order on your resume.

Unfortunately, if you’re struggling to get a programming job, you can’t afford to ignore these automated filters and assume your natural talent will shine through. Like it or not, you might have to play the game a bit to get your ideal programming job.

How to solve this problem

If you suspect this is the reason you can’t get a programming job, there are two methods to ensure you’re not getting overlooked by a machine.

First, and most obvious, is to optimize your resume for keywords. Take another look at the job application, and tick off every bullet that you’ve included in your own resume in the company’s words. The Balance Career’s blog post on resume keywords recommends also ensuring your resume reflects the company’s brand, which is what sets them apart, so try checking their LinkedIn page, as well as the LinkedIns of current employees.

The second method is less intuitive: remember that recruiters are human, too. You can throw your resume on a stack and hope it gets looked at, or you can go the extra mile and find a current employee with a similar job title, a hiring manager, or whoever your boss might be, and send them a short LinkedIn message. You can express enthusiasm for the role, ask questions about current duties, or even just let them know you’ve just applied and are looking forward to hearing back.

It’s a way to put yourself on their radar, and make sure your name is already familiar if your resume does come up. Never forget that a human touch goes a long way.

Final thoughts on why you can’t get a programming job

It’s possible to get rejected at two potential stages: getting an interview and getting a job offer after the interview. To maximize your chances if you can’t get a programming job, see where you’re struggling and apply the relevant tips there.

Overall, the advice boils down to this: remember human skills are still relevant. Don’t lie. Stick to your strengths. And cover the basics first.

If you can do that, you’ll have solved the primary reasons you can’t get a programming job.

Keep working at it, and remember: employers are as desperate for good programmers as you are for a good job. All you have to do is show them that you’re the candidate they’ve been dreaming of.

Originally published at https://qvault.io on January 18, 2021.

Largest Rectangle in a Matrix

Largest Rectangle in a Matrix

How to combine programming techniques

As I accumulate more experience in coding and life in general, one of the things among my observations that stood out to me is that, whenever there is a problem to be solved, there usually exists a very intuitive solution to it. Sometimes this solution happens to be efficient, sometimes grossly suboptimal. This is especially noticeable in algorithms, because an algorithm is often times a conceptualization of a real life problem, and through the algorithm the essential elements of the problem is kept, allowing us to tackle the problem directly.

The largest rectangle in a matrix is one such problem, in my opinion. Here is a brief description of the problem.

Problem

I have a board of red and blue dots, would like to find a biggest rectangle formed by the blue dots:

(image by author)

In this example, it’s easy to tell that the largest rectangle formed by the blue dots has a size of 8.

But let’s figure out a way to do this algorithmically.

Solution #1

How to do this simply

The first time I tried out this problem, I thought to myself:

Well, it looks like I probably need to iterate through every point in the matrix, and at each point, I need to find the largest rectangle containing that point. I’ll compare that rectangle to the size of the rectangles I have already found. If the new one is bigger, I’ll keep it, otherwise I move on to the next point.

This sounds great, but there is a problem with it. How do I find the largest rectangle containing a particular point. Nothing comes to mind that would help me solve this problem.

What if I just want to find the largest rectangle that has the current point as its top left corner? I think this is a more manageable problem.

In order for me to figure that out, I’ll loop through each point to the right of my current point, at each point I find the maximum height of blue dots, if it’s smaller than the height at the current point, I’ll update the current point height to the new height and find the size of the new rectangle, if it’s a bigger rectangle I’ll update the max size.

Let’s apply this procedure to our example.

Suppose I have looped to point (1, 0):

(Image by Author)

I find the height of my current point, which is 2, the largest rectangle with (1, 0) as top right corner given this information is also 2.

(1, 0): height = 2, max_rectangle = 2.

I iterate through every point to the right:

Point (2, 0) has a height of 3, but it’s larger than the starting point, so the height is still 2. But now we know that we can have a rectangle of size 2 * 2 = 4:

(2, 0): height = 2, max_rectangle = 4

Point (3, 0) has a height of 1, it’s smaller than current height, so we update current height to 1, the rectangle that can be created is: 1 * 3 = 3, but the current max is 4, so we ignore it:

(3, 0): height = 1, max_rectangle = 4

Going through the same process for the rest of points:

(4, 0): height = 1, max_rectangle = 4

(5, 0): height = 1, max_rectangle = 5

we find the largest rectangle with top right corner of (1, 0) to be of size 5.

Let’s put this into code:

def find_max001(matrix):

width = len(matrix[0])
height = len(matrix)

# max width and max height at the a point
# uses memorization and dynamic programming
max_matrix = [[None for v in row] for row in matrix]
def get_max(i, j):
if i >= width:
return 0, 0
elif j >= height:
return 0, 0
elif max_matrix[j][i] is not None:
return max_matrix[j][i]
elif matrix[j][i] == 0:
max_matrix[j][i] = (0, 0)
return max_matrix[j][i]

max_down = get_max(i, j + 1)
max_right = get_max(i + 1, j)

max_matrix[j][i] = (max_right[0] + 1,
max_down[1] + 1)
return max_matrix[j][i]

max_rect = 0
for i in range(width):
for j in range(height):
rect = get_max(i, j)
cur_max = rect[1]
for k in range(1, rect[0]):
cur_max = min(cur_max, get_max(i+k, j)[1])

max_rect = max(max_rect, cur_max * rect[0])

return max_rect
def problem003(solver):

m001 = [
[1, 1, 1, 1, 1, 1],
[0, 1, 1, 0, 1, 1],
[0, 0, 1, 0, 1, 1],
[0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0]
]

res1 = solver(m001)
print(f'res1: {res1}')
def test003():
solver = find_max001
problem003(solver)
test003()
# res1: 8

Performance of Solution #1

What is the complexity of the algorithm?

Since we are looping through each point, the complexity is at least w * h. Luckily we are able to use memoization and dynamic programming technique to find the height at each point, so they don’t add to the complexity.

At each point we are looping through the width of the matrix to find the largest rectangle at that point, this slows the complexity to w*h*w.

So the complexity is: O(w²*h)

Since we are also using a map to store width and height at each point:

memory usage is: O(w*h).

Solution #2

How to improve the algorithm?

Is there a way to reduce the w² term in complexity?

It turns out, there is.

Let’s go back to our example:

(Image by Author)

Suppose now we are looping through the first row.

From (0, 0), (1, 0), (2, 0) the height keeps increasing 1, 2, 3. We keep a list:

[(0, 0: 1), (1, 0: 2), (2, 0: 3)]

At position (3, 0), height drops to 1. What does this new information tell us?

Quite a bit, actually. Given this information, we can say for certain that a rectangle with top left corner at position (2, 0) with height 3 can only have a maximum size of 3.

We can also say for certain that a rectangle with top left corner at position (1, 0) with height 2 can only have a maximum size of 4.

After process the two points, we can remove them permanently, but we need to add a new point, at (1, 0) (note, not at (3, 0)), with height 1:

[(0, 0: 1), (1, 0: 1)]

Essentially what we are doing is trimming the height of (2, 0) and (1, 0) to equal to the height of (3, 0).

Moving on this way till the end of the row, we do not encounter any more drop in height.

[(0, 0: 1), (1, 0: 1), (5, 0: 4), (4, 0: 4)]

We can then process the rest of the rows:

(5, 0) max_rectangle = 4

(4, 0) max_rectangle = 4 * 2 = 8

(1, 0) max_rectangle = 1 * 5= 5

(0, 0) max_rectangle = 1 * 6 = 6

So the max_rectangle after processing the first row is 8, and we have only looped 6 points! This means the complexity of the algorithm is now w*h!

Below is the algorithm in code:

from collections import deque

def find_max002(matrix):

width = len(matrix[0])
height = len(matrix)

# max depths
max_matrix = [[None for v in row] for row in matrix]

def get_max(i, j):
if i >= width:
return 0, 0
elif j >= height:
return 0, 0
elif max_matrix[j][i] is not None:
return max_matrix[j][i]
elif matrix[j][i] == 0:
max_matrix[j][i] = (0, 0)
return max_matrix[j][i]

max_down = get_max(i, j + 1)
max_right = get_max(i + 1, j)

max_matrix[j][i] = (max_right[0] + 1, max_down[1] + 1)
return max_matrix[j][i]

def get_rect(stack, j):
cur_idx = stack.pop()
cur_max = cur_idx[1] * (j - cur_idx[0])
print(f"cur_max at {cur_idx[0]}: {cur_max}")
return cur_max

max_rect = 0
for i in range(width):

# implement the algorithm with stack
stack = deque()
stack.append((-1, 0))
for j in range(height):
rect = get_max(i, j)
cur_width = rect[0]
cur_idx = j
while stack[-1][1] > cur_width:
cur_idx = stack[-1][0]
max_rect = max(max_rect,
get_rect(stack, j))
stack.append((cur_idx, cur_width))

while len(stack) > 1:
max_rect = max(max_rect, get_rect(stack, height))

return max_rect
def test004():

solver = find_max002

problem003(solver)
test004()
# res1: 8

Notice in order to implement the removal of points (2, 0) and (1, 0) after we reach (3, 0), we use a stack data structure to push and pop points efficiently.

Complexity of Solution #2

As mentioned earlier, we have improved upon solution #1 so that at each point, we don’t have to loop through the rest of the points to its right anymore. The complexity improves to: O(w*h)!

The memory usage remains the same: O(w*h).

What to get out of this?

besides the algorithm and coding techniques

An algorithm can have many variations, the first one you come up with is usually not the most efficient. An understanding of complexity analysis and programming techniques can make a big difference. The importance of this realization extends not only to programming, but to doing almost everything in life, because everything in life is an algorithm.

Data Science Learning Roadmap for 2021

Data Science Learning Roadmap for 2021

Building your own learning track to master the art of applying data science

Image for post

Although nothing really changes except for the date, a new year fills everyone with the hope of starting things afresh. Adding a bit of planning, well-envisioned goals and a learning roadmap makes for a great recipe for a year full of growth.

This post intends to strengthen your plan by providing you with a learning framework, resources, and project ideas to build a solid portfolio of work showcasing expertise in data science.

Disclaimer:
The roadmap defined is prepared based on my little experience in data science. This is not the be-all and end-all learning plan. The roadmap may change to better suit any specific domain/field of study. Also, this is created keeping python in mind as I personally prefer to use python.

What is a learning roadmap?

In my humble opinion, a learning roadmap is an extension of a curriculum that charts out multi-level skills map with details on what skills you want to hone, how you will measure the outcome at each level, and techniques to further master each skill.

My roadmap assigns weights to each level based on the complexity and commonality of application in the real-world. I have also added an estimated time for a beginner to complete each level with exercises/projects.

Here is a pyramid that depicts the high-level skills in order of their complexity and application in the industry.

Image for post

This would mark the base of our framework, we’ll now have to deep dive into each of these strata to complete our framework with more specific, and measurable details.

Specificity comes from enlisting the critical topics in each stratum and resources to refer to master those topics.

We’d be able to measure it by applying the learned topics to a number of real-world projects. I’ve added a few project ideas, portals, and platforms that you can use to measure your proficiency.

Imp NOTE: Take it one day at a time, one video/blog/chapter a day. It is a wide spectrum to cover. Don’t overwhelm yourself!

Let’s deep dive into each of these strata, starting from the bottom.

1. Programming or Software Engineering

(Estimated time: 2-3 months)

Firstly, make sure you have sound programming skills. Every data science job description would ask for programming expertise in at least one of the languages.

Specific topics include:

  • Common data structures(data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries.
  • SQL scripting: Querying databases using joins, aggregations, and subqueries
  • Comfortable with using the Terminal, version control in Git, and using GitHub

Resources for python:

  • learnpython.org [free]— a free resource for beginners. It covers all the basic programming topics from scratch. You get an interactive shell to practice those topics side-by-side.
  • Kaggle [free]— a free and interactive guide to learning python. It is a short tutorial covering all the important topics for data science.
  • Python Course by freecodecamp on YouTube[free] — This is a 5-hour course that you can follow to practice the basic concepts.
  • Intermediate python [free]— Another free course by Patrick featured on freecodecamp.org.
  • Coursera Python for Everybody Specialization[fee] — this is a specialization encompassing beginner-level concepts, python data structures, data collection from the web, and using databases with python.

Git

  • Guide for Git and GitHub[free]: complete these tutorials and labs to develop a firm grip over version control. It will help you further in contributing to open-source projects.

SQL

Measure your expertise by solving a lot of problems and building at least 2 projects:

  • Solve a lot of problems here: HackerRank(beginner-friendly), LeetCode(solve easy or medium-level questions)
  • Data Extraction from a website/API endpoints — try to write python scripts from extracting data from webpages that allow scraping like soundcloud.com. Store the extracted data into a CSV file or a SQL database.
  • Games like rock-paper-scissor, spin a yarn, hangman, dice rolling simulator, tic-tac-toe, etc.
  • Simple web apps like youtube video downloader, website blocker, music player, plagiarism checker, etc.

Deploy these projects on GitHub pages or simply host the code on GitHub so that you learn to use Git.

2. Data collection and Wrangling(Cleaning)

(Estimated time: 2 months)

A significant part of the data science work is centered around finding apt data that can help you solve your problem. You can collect data from different legitimate sources — scraping(if the website allows), APIs, Databases, publicly available repositories.

Once you have data in hand, an analyst will often find herself cleaning dataframes, working with multi-dimensional arrays, using descriptive/scientific computations, manipulating dataframes to aggregate data.

Data is rarely clean and formatted for use in the “real world”. Pandas and NumPy are the two libraries that are at your disposal to go from dirty data to ready-to-analyze data.

As you start feeling comfortable writing python programs, feel free to start taking up lessons on using libraries like pandas and numpy.

Resources:

Project Ideas:

  • Collect data from a website/API(open for public consumption) of your choice, collect the data, and transform the data to store data from different sources into an aggregated file or table(DB). Example APIs include TMDB, quandl, Twitter API, etc.
  • Pick any publicly available dataset; define a few set of questions that you’d want to pursue after looking at the dataset and the domain. Wrangle the data to find out answers to those questions using pandas and NumPy.

3. EDA, Business acumen and Storytelling

(Estimated time: 2–3 months)

The next stratum to master is data analysis and storytelling. Drawing insights from the data and then communicating the same to the management in simple terms and visualizations is the core responsibility of a Data Analyst.

The storytelling part requires you to be proficient with data visualization along with excellent communication skills.

Specific topics:

  • Exploratory data analysis — defining questions, handling missing values, outliers, formatting, filtering, univariate and multivariate analysis.
  • Data visualization — plotting data using libraries like matplotlib, seaborn, and plotly. Knowledge to choose the right chart to communicate the findings from the data.
  • Developing dashboards — a good percent of analysts only use Excel or a specialized tool like Power BI and Tableau to build dashboards that summarise/aggregate data to help the management in making decisions.
  • Business acumen: Work on asking the right questions to answer, ones that actually target the business metrics. Practice writing clear and concise reports, blogs, and presentations.

Resources:

Project Ideas

4. Data Engineering

(Estimated time: 4–5 months)

Data engineering underpins the R&D teams by making clean data accessible to research engineers and scientists at big data-driven firms. It is a field in itself and you may decide to skip this part if you want to focus on just the statistical algorithm side of the problems.

Responsibilities of a data engineer comprise building an efficient data architecture, streamlining data processing, and maintaining large-scale data systems.

Engineers use Shell(CLI), SQL, and python/Scala, to create ETL pipelines, automate file system tasks, and optimize the database operations to make it high-performance. Another crucial skill is implementing these data architectures which demand proficiency in cloud service providers like AWS, Google Cloud Platform, Microsoft Azure, etc.

Resources:

Project Ideas/Certifications to prepare for:

  • AWS Certified Machine Learning(300 USD) — A proctored exam offered by AWS, adds some weight to your profile(doesn’t guarantee anything though), requires a decent understanding of AWS services and ML.
  • Professional Data Engineer — Certification offered by GCP. This is also a proctored exam and assesses your abilities to design data processing systems, deploying machine learning models in a production environment, ensure solutions quality and automation.

5. Applied statistics and mathematics

(Estimated time: 4–5 months)

Statistical methods are a central part of data science. Almost all the data science interviews predominantly focus on descriptive and inferential statistics.

People start coding machine learning algorithms without a clear understanding of underlying statistical and mathematical methods that explain the working of those algorithms.

Topics you should focus on:

  • Descriptive Statistics — to be able to summarise the data is powerful but not always. Learn about estimates of location(mean, median, mode, weighted statistics, trimmed statistics), and variability to describe the data.
  • Inferential statistics — designing hypothesis tests, A/B tests, defining business metrics, analyzing the collected data and experiment results using confidence interval, p-value, and alpha values.
  • Linear Algebra, Single and multi-variate calculus to understand loss functions, gradient, and optimizers in machine learning.

Resources:

  • [Book]Practical statistics for data science(highly recommend) — A thorough guide on all the important statistical methods along with clean and concise applications/examples.
  • [Book]Naked Statistics — a non-technical but detailed guide to understanding the impact of statistics on our routine events, sports, recommendation systems, and many more instances.
  • Statistical thinking in Python — a foundation course to help you start thinking statistically. There is a second part to this course as well.
  • Intro to Descriptive Statistics— offered by Udacity. Consists of video lectures explaining widely used measures of location and variability(standard deviation, variance, median absolute deviation).
  • Inferential Statistics, Udacity — the course consists of video lectures that educate you on drawing conclusions from data that might not be immediately obvious. It focuses on developing hypotheses and use common tests such as t-tests, ANOVA, and regression.

Project Ideas:

  • Solve the exercises provided in the courses above and then try to go through a number of public datasets where you can apply these statistical concepts. Ask questions like “Is there sufficient evidence to conclude the mean age of mothers giving birth in Boston is over 25 years of age at the 0.05 level of significance.”
  • Try to design and run small experiments with your peers/groups/classes by asking them to interact with an app or answer a question. Run statistical methods on the collected data once you have a good amount of data after a period of time. This might be very hard to pull off but should be very interesting.
  • Analyze stock prices, cryptocurrencies, and design hypothesis around the average return or any other metric. Determine if you can reject the null hypothesis or fail to do so using critical values.

6. Machine Learning / AI

(Estimated time: 4–5 months)

After grilling yourself through all the major aforementioned concepts, you should now be ready to get started with the fancy ML algorithms.

There are three major types of learning:

  1. Supervised Learning — includes regression and classification problems. Study simple linear regression, multiple regression, polynomial regression, naive Bayes, logistic regression, KNNs, tree models, ensemble models. Learn about evaluation metrics.
  2. Unsupervised Learning — Clustering and dimensionality reduction are the two widely used applications of unsupervised learning. Dive deep into PCA, K-means clustering, hierarchical clustering, and gaussian mixtures.
  3. Reinforcement learning(can skip*) — helps you build self-rewarding systems. Learn to optimize rewards, using the TF-Agents library, creating Deep Q-networks, etc.

The majority of the ML projects need you to master a number of tasks that I’ve explained in this blog.

Resources:

Deep Learning Specialization by deeplearning.ai

For those of you who are interested in further diving into deep learning can start off by completing this specialization offered by deeplearning.ai and the Hands-ON book. This is not as important from a data science perspective unless you are planning to solve a computer vision or NLP problem.

Deep learning deserves a dedicated roadmap of its own. I’ll create that with all the fundamental concepts and

Track your learning progress

Image for post

I’ve also created a learning tracker for you on Notion. You can customize it to your needs and use it to track your progress, have easy access to all the resources and your projects.

Find the video version of this blog below!

Data Science with Harshit

This is just a high-level overview of the wide spectrum of data science and you might want to deep dive into each of these topics and create a low-level concept-based plan for each of the categories.

Feel free to respond to this blog or comment on the video if you want me to add a new topic or rename anything. Also, let me know which category would you like me to do project tutorials on.

You can connect with me on Twitter or LinkedIn.

Predicting Song Skipping on Spotify

Predicting Song Skipping on Spotify

Using LightGMB to predict my song skipping habits based solely on audio features

Introduction

In early 2019, Spotify shared interesting statistics about their platform. Out of 35+ million songs on the service, Spotify users created over 2+ billion playlists (Oskar Stål, 2019). I thought of the analogy that our music taste is like our DNA, very diverse across 7 billion people, yet the building blocks (nucleotides/songs) are the same. As a result, inferring a user’s music taste is challenging, mostly since Spotify’s business model relies on its ability to recommend new songs.

Problem Statement

Spotify doesn’t have a dislike button, so skipping songs are the subtle cues we need to learn from to infer music taste. In this project, I use my Spotify streaming history in 2019 to build a predictive model that anticipates whether I would skip a song or not based solely on their audio features.

You can request your own Spotify streaming history following these steps

Data Descriptions

After requesting my Spotify data, I received an email with a ZIP file containing every song I listened to in 2019, the artist’s name, and the streaming duration. The data processing is as follow:

  1. I filter out podcasts and only analyze songs.
  2. I used the Spotify API to extract the unique IDs of songs and their audio features.
  3. I compute the gap between the duration I streamed the track for and the song’s length. If the gap exceeds 60 seconds, then I would induce that the song has been skipped.

Below is a detailed python implementation of the steps

Since the claim is to seek if only audio characteristics can inform us on song skipping, I dropped the columns that contain the song’s title and artist.

The final dataset has the following columns:

Image for post

Assumptions

A crucial step in modeling is to lay out all the assumptions and limitations in order to properly interpret the result. Some assumptions are due to the data collection process and others are part of the modeling process:

  • The user’s music taste is homogenous, i.e., the mechanism that leads a user to skip a song is static across time.
  • Songs are broken down into audio features hence the lyrics are not interpreted as natural language text. This limitation is important to consider since lyrical meaning can be a strong predictor of song skipping.

Modeling

I use LightGBM binary classification to infer my song-skipping habits based solely on audio features.

Image for post

Bayesian Optimization

LightGBM contains many parameters, thus, instead of running through all of their possible values, I used Bayesian Optimization for hyperparameter tuning

Image for post

Results & Discussion

The model performs better with personalized data with an accuracy of 74.17% (28th iteration of Bayesian Optimization). The assumption that Spotify users are homogeneous is a strong one, and the performance can be improved if we gather more user level details.

Overall, recommendation engines require both personalized learning about the user and general learning about the songs. In this project, I experimented with machine learning classification using only audio features, audio and user features, and my personal listening history. A further investigation might include the causal relationships between the covariates because perhaps understanding the mechanism by which the data is generated is more informative than curve-fitting.

References

  • Oskar Stål (2019). Music Recommendations at Spotify. Nordic Data Science and Machine Learning Summit. Retrieved from: https://youtu.be/2VvM98flwq0
  • Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The Music Streaming Sessions Dataset. In Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3308558.3313641

Facebook’s PyGraph is an Open Source Framework for Capturing Knowledge in Large Graphs

Facebook’s PyGraph is an Open Source Framework for Capturing Knowledge in Large Graphs

The new framework can learn graph embeddings in large graph structures.

Image for post

Source: https://morioh.com/p/fdc360a84d73

I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

Graphs are one of the fundamental data structures in machine learning applications. Specifically, graph-embedding methods are a form of unsupervised learning, in that they learn representations of nodes using the native graph structure. Training data in mainstream scenarios such as social media predictions, internet of things(IOT) pattern detection or drug-sequence modeling are naturally represented using graph structures. Any one of those scenarios can easily produce graphs with billions of interconnected nodes. While the richness and intrinsic navigation capabilities of graph structures is a great playground for machine learning models, their complexity posses massive scalability challenges. Not surprisingly, the support for large-scale graph data structures in modern deep learning frameworks is still quite limited. Recently, Facebook unveiled PyTorch BigGraph, a new framework that makes it much faster and easier to produce graph embeddings for extremely large graphs in PyTorch models.

To some extent, graph structures can be seen as an alternative to labeled training dataset as the connections between the nodes can be used to infer specific relationships. This is the approach followed by unsupervised graph embedding methods which learn a vector representation of each node in a graph by optimizing the objective that the embeddings for pairs of nodes with edges between them are closer together than pairs of nodes without a shared edge. This is similar to how word embeddings like word2vec are trained on text.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Most graph embedding methods result quite constrained when applied to large graph structures. To give a example, a model with two billion nodes and 100 embedding parameters per node (expressed as floats) would require 800GB of memory just to store its parameters, thus many standard methods exceed the memory capacity of typical commodity servers. To represents a major challenge for deep learning models and is the genesis of Facebook’s BigGraph framework.

PyTorch BigGraph

The goal of PyTorch BigGraph(PBG) is to enable graph embedding models to scale to graphs with billions of nodes and trillions of edges. PBG achieves that by enabling four fundamental building blocks:

  • graph partitioning, so that the model does not have to be fully loaded into memory
  • multi-threaded computation on each machine
  • distributed execution across multiple machines (optional), all simultaneously operating on disjoint parts of the graph
  • batched negative sampling, allowing for processing >1 million edges/sec/machine with 100 negatives per edge

PBG addresses some of the shortcomings of traditional graph embedding methods by partitioning the graph structure into randomly divided into P partitions that are sized so that two partitions can fit in memory. For example, if an edge has a source in partition p1 and destination in partition p2 then it is placed into bucket (p1, p2). In the same model, the graph edges are then divided into P2 buckets based on their source and destination node. Once the nodes and edges are partitioned, training can be performed on one bucket at a time. The training of bucket (p1, p2) only requires the embeddings for partitions p1 and p2 to be stored in memory. The PBG structure guarantees that buckets have at least one previously-trained embedding partition.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Another area in which PBG really innovates is the parallelization and distribution of the training mechanics. PBG uses PyTorch parallelization primitives to implement a distributed training model that leverages the block partition structure illustrated previously. In this model, individual machines coordinate to train on disjoint buckets using a lock server which parcels out buckets to the workers in order to minimize communication between the different machines. Each machine can train the model in parallel using different buckets.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

In the previous figure, the Trainer module in machine 2 requests a bucket from the lock server on machine 1, which locks that bucket’s partitions. The trainer then saves any partitions that it is no longer using and loads new partitions that it needs to and from the sharded partition servers, at which point it can release its old partitions on the lock server. Edges are then loaded from a shared filesystem, and training occurs on multiple threads without inter-thread synchronization. In a separate thread, a small number of shared parameters are continuously synchronized with a sharded parameter server. Model checkpoints are occasionally written to the shared filesystem from the trainers. This model allows a set of P buckets to be parallelized using up to P/2 machines.

One of the indirect innovations of PBG is the use of batched negative sampling techniques. Traditional graph embedding models, construct random “false” edges as negative training examples along with the true positive edges. This significantly speeds training because only a small percentage of weights must be updated with each new sample. However, the negative samples end up introducing a performance overhead in the processing of the graph and end up “corrupting” true edges with random source or destination nodes. PBG introduces a method that reuses a single batch of N random nodes to produce corrupted negative samples for N training edges. In comparison to other embedding methods, this technique allows us to train on many negative examples per true edge at little computational cost.

To increase memory efficiency and computational resources on large graphs, PBG leverages a single batch of Bn sampled source or destination nodes to construct multiple negative examples.In a typical setup, PBG takes a batch of B = 1000 positive edges from the training set, and breaks it into chunks of 50 edges. The destination (equivalently, source) embeddings from each chunk is concatenated with 50 embeddings sampled uniformly from the tail entity type. The outer product of the 50 positives with the 200 sampled nodes equates to 9900 negative examples.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

The batched negative sampling approach has a direct impact in the speed of the training of the models. Without batching, the speed of training is inversely proportional to the number of negative samples. Batched training improves that equation achieving constant training speed.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

Facebook evaluated PGB using different graph datasets such as LiveJournal, Twitter data and YouTube user interaction data. Additionally, PBG was benchmarked using the Freebase knowledge graph, which contains more than 120 million nodes and 2.7 billion edges as well as a smaller subset of the Freebase graph, known as FB15k, which contains 15,000 nodes and 600,000 edges and is commonly used as a benchmark for multi-relation embedding methods. The FB15k experiments showed PBG performing similarly to state of the art graph embedding models. However, when evaluated against the full Freebase dataset, PBG show memory consumptions improves by over 88%.

Image for post

Source: https://github.com/facebookresearch/PyTorch-BigGraph?fbclid=IwAR1X2QJ5zltf6-f_OZOB2YmBqqQOM99RehXu_kqCmWA_LyPeBfR4MhSXccU

PBG is one of the first methods that can scale and the training and processing of graph data to structures with billions of nodes and trillions of edges. The first implementation of PBG has been open sourced in GitHub and we should expect interesting contributions in the near future.

Kubernetes is deprecating Docker in the upcoming release

Kubernetes is deprecating Docker in the upcoming release

Kubernetes and Docker will part ways; what does that mean to you?

Photo by CHUTTERSNAP on Unsplash

This moment was long in coming; Kubernetes is deprecating Docker as a container runtime after version 1.20, in favor of runtimes that use the Container Runtime Interface(CRI) created for Kubernetes. However, this does not mean Docker’s death, and it does not mean that you should also abandon your favorite containerization tool.

As a matter of fact, not a whole lot will be changing for you, as an end-user of Kubernetes; you will still be able to build your containers using docker and the images produced by running docker build will still run in your Kubernetes cluster.

Then, why all this fuss? What is changing, and why Docker seems like the black-sheep out of a sudden? Should we continue writing Dockerfiles?

Learning Rate is my weekly newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news, research, repos and books. Subscribe here!

Don’t panic

The source of confusion here is that we are talking about two different things. Inside the Kubernetes cluster nodes, a container runtime daemon manages the complete container lifecycle: image pulling and storage, container execution and supervision, network attachments, and many more.

Docker is arguably the most popular choice; however, Docker was not designed to be embedded inside Kubernetes. That is the root of all problems. Docker is not just a container runtime; it is an entire tech stack with many UX enhancements that make it easy for us to interact with it. Indeed, Docker contains a high-level container runtime in itself: contrainerd. And containerd will be a container runtime option for you moving forward.

Moreover, these UX enhancements are not necessary for Kubernetes. If anything, they are obstacles that Kubernetes must workaround to get what it really needs. This means that the Kubernetes cluster has to use another tool called Dockershim, which is containerd. That adds a level of complexity and another tool that the team should maintain. Another source that could produce bugs and problems.

So, what is really happening here is that Kubernetes will remove Dockershim in version 1.23, which will break Docker support.

Should you care?

So, what is changing for you as a developer? Not that much. If you are using Docker in your development process, you will continue to do that, and you will not notice any differences. When you build an image using Docker, the result is not a Docker-specific thing. It’s an OCI (Open Container Initiative) image. Kubernetes and its compliant container runtimes (e.g., containerd or CRI-O) know how to pull and work with those images. This is why we have a standard for what containers should look like in the first place.

On the other hand, if you are using a managed Kubernetes service, like GKE or EKS, you will need to make sure that your nodes are running a supported container runtime before Docker support is removed re-apply or update your custom configurations if you use any. If you are running Kubernetes on-premise, you should also need to make changes to avoid unwanted problems and surprises.

Conclusion

At version 1.20, you will get a deprecation warning for Docker. This change is coming, and like any other, it will likely cause some issues at first. But it isn’t catastrophic, and in the long run, it’s going to make things easier.

I hope this article made some things clearer and relieved some anxieties. At the end of the day, these changes will probably mean nothing to you as a developer.

Learning Rate is my weekly newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news, research, repos and books. Subscribe here!

About the Author

My name is Dimitris Poulopoulos, and I’m a machine learning engineer working for Arrikto. I have designed and implemented AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.

If you are interested in reading more posts about Machine Learning, Deep Learning, Data Science, and DataOps, follow me on Medium, LinkedIn, or @james2pl on Twitter.

Opinions expressed are solely my own and do not express the views or opinions of my employer.

The Reason Behind if __name__ == ‘__main__’ in Python

The Reason Behind if __name__ == ‘__main__’ in Python

Why is it necessary?

the statement if_name_==’_main_’:

Photo by author generated from carbon

You might have seen this one before: the syntax which often gets ignored because it doesn’t seem to hinder the execution of your code. It may not seem necessary, but that’s only if you’re working with a single Python file.

Let’s Get Right Into It!

Let’s start out by deconstructing the statement from left to right. We already know what an if statement is; however, the most important part of the statement are the two things being compared.

Let’s start with __name__ .This is used to denote the name of the file that is currently being run, but there is a trick to this. The file currently being run will always have the value __main__ .

This sounds confusing at first but let’s clarify.

Let’s create two Python files:

  • current_script.py
  • other_script.py

Please ensure these files are in the same directory/folder.

Inside the other_script.py file, we’ll add a two-print statement, just as shown below.

print("****inside other script.py*****")print("__name__ is ",__name__)

Run this other_script.py file.

Note: I will be running this file while using Python within the terminal, just as illustrated below. Also note that I am working from a Windows operating system.

python other_script.py
Image for post

Output:

****inside other script.py*****
__name__ is __main__

Now you realize that it’s just as I stated before. The file being executed will always have the value __main__. This represents the point of entry into our application.

In Python and pretty much every programming language, we can import other files into our application. We’ll now go into our current_script.py and input the following code:

import other_scriptprint("")print("****inside current script.py*****")print("__name__ is ",__name__)

The code above imports the other_script.py with the import statement at the top, which is followed by print(“****inside current script.py*****”) to verify that we are in thecurrent_script.py file.

Be aware that because we imported other_script at the top of the file, this therefore means that the entire contents of other_script.py will now be injected into where import other_script is.

Before we continue, take keen note of the input of when we ran other_script.py. Now observe what happens when we execute current_script.py.

python current_script.py
Image for post

Output:

****inside other script.py*****
__name__ is other_script
****inside current script.py*****
__name__ is __main__

You will now realize that previously when we ran other_script.py, it gave us the value for __name__ as __main__. But now since we ran it as an import in current_script.py, the value of __name__ suddenly changed to the name of the imported script which is other_script.

Furthermore, the value of __name__ for current_script.py is __main__. This goes back to what I had highlighted previously: The file currently being run will always have the value __main__.

Let’s put this all together now.

The file you are currently running will always be __main__, while any other imported files will not be. Those will have the name of their respective files.

Use Cases

This syntax comes in handy when you have programs that have multiple Python files.

Let’s create a program that has two classes. A Name class and a Person class. These two classes will be placed in two separate files, name.py and person.py. The Person class uses theName class in this system.

We’ll start out by building the Name class in the name.py file. This is a simple class that has only two attributes, fname (first name) and lname (last name) along with their corresponding getters and setters.

__repr__ is the default output when the object is printed.

We added our syntax if__name__ == “__main__:”. Based on our understanding, we can tell that the body of this if statement will only be executed if the main file is the one being executed — meaning it is not an import.

But why do we do this?

We do this because it is one of the most important steps when we want certain operations to be done only on the file we are currently running. In this scenario, we wrote a Name class and we are testing out its functionalities.

Output:

fname=Jordan;lname=Williams

As you can see from the output above, we were able to test the functionality of the Name class. However, this concept will not hit home until we’ve built the other class.

Let’s create our person.py file.

Notice from name import Name, where I imported the Name class into our file, it was used on line 7 whereself.name = Name(fname, lname).

Output:

201107
John Brown
Male

This is the output from testing our Person class. Notice that there is no output from the Name class because it is encased under the condition __name__ == “__main__” and it is currently not the main file.

Let’s now remove __name__ == “__main__” from name.py and see the difference:

Notice that __name__ == “__main__” is not removed. We will now run our person.py file.

Output:

fname=Jordan;lname=Williams
201107
John Brown
Male

See, here we only wanted to test the functionality of the Person class. However, we are getting outputs from the Name class as well.

This could also be a problem if you had made some Python library and wanted to extend that functionality to another class but do not want that other library to run automatically in your current script.

Other Languages

Some of you who dabble in other programming languages might have noticed that this is the same as the main method or functions found in other languages. Programming languages such as Java with public static void main(String[] args) , C# with a similar public static void Main (string[] args) and C with int main(void) all have some sort of main function or method present to execute multiple files/scripts of code.

Let’s look at the equivalent code in another language.

Let’s look at Java for instance.

Summary

Sometimes you want to run some logic only in the file currently being executed. This comes in handy in testing individual units of code without the hindrance of other files being executed. This can come in handy when building libraries that depend on other libraries. You wouldn’t want a rogue execution of another library in the code you are in.