Automatic Open Source-based Data Pipelines? Openshift To the Rescue!

2021년 2월 23일 by hyungseok

Automatic Open Source-based Data Pipelines? Openshift To the Rescue!

Shon Paz

2 days ago·11 min read

Image for post — Photo by Stephen Dawson on Unsplash

Kubernetes is our new Operating System, no one can doubt that anymore. As a lot of effort has been made, in order to develop a micro-services approach, and migrate workloads towards Kubernetes, organizations left their data services behind.

We all saw, due to COVID-19, how important data is, and how important it is to have the proper architecture and data fabric. Data won’t stop from growing! more even, it’ll just keep breaking its consumption records one year after the other.

This challenge forces us to provide a more automatic, and scalable solution to our organization by moving our data services to Kubernetes, where all the magic happens. Kubernetes offers Operators that will help you manage both day-1 and day-2 operations, using health checks, state preserving, auto-pilots, etc.

In this demo, I’d like to show you how you can run your automatic data pipelines, using Operators that are offered in every Openshift installation via the Operator Hub. I chose to take Real-Time BI as a use case, and build this demo around it. This demo leverages Openshift’s mechanisms, in order to create scalable, Kubernetes-based data pipelines and uses all the de-facto standard products in order to fulfill those requirements. All Operators will be deployed on an Openshift 4 cluster, while Openshift Container Storage will provide the underlying storage solution for both Object and Block storage protocols.

This demo deploys a music streaming application, that generates events based on users’ behavior (will be explained further).

Using that data that is being generated, we can use Open Source tools in order to create our dashboards and visualizations, and provide our stakeholders so as our data scientists a more reliable way to visualize important data.

THIS AFFECTS BUSINESS LOGIC DIRECTLY!

Now that the message is clear, let’s start playing!

Prerequisites

A running Ceph Cluster (> RHCS4)
A running Openshift 4 cluster (> 4.6.8)
An OCS cluster, in external mode, to provide both Object and Block storage

Installation

Create a new project in your Openshift cluster, where all resources should be deployed:

$ oc new-project data-engineering-demo

Install both AMQ Streams and Presto Operators, as we’ll need those to create our relevant resources. go the the Operator Hub section on the left panel to install:

Clone the needed git repository so you’ll be able to deploy the demo:

$ git clone https://github.com/shonpaz123/cephdemos.git

Change your directory into the demo directory, where all manifests are located:

$ cd cephdemos/data-engineering-pipeline-demo-ocp

Data Services Preparation

Preparing our S3 environment

Now that we have all the prerequisites ready, let’s start by creating our needed S3 resources. As we are using an external Ceph cluster, we should create the needed S3 user in order to interact with the cluster. Additionally, we need to create an S3 bucket so that Kafka could export our events to the data lake. Let’s create those resources:

$ cd 01-ocs-external-ceph && ./run.sh && cd ..

The expected output:

{
    "user_id": "data-engineering-demo",
    "display_name": "data-engineering-demo",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "data-engineering-demo",
            "access_key": "HC8V2PT7HX8ZFS8NQ37R",
            "secret_key": "Y6CENKXozDDikJHQgkbLFM38muKBnmWBsAA1DXyU"
        }
    .
    .
    .
}
make_bucket: music-chart-songs-store-changelog

The script uses awscli in order to export our credentials as environment variables, so that we’ll be able to create the bucket properly. Make sure that you have access to your endpoint URL with all the open ports so that this script will work properly.

Deploying Kafka new-ETL

Now that we have our S3 ready, we need to deploy all the needed Kafka resources. In this section we’ll deploy a Kafka cluster, using the AMQ Streams operator, that is offered via the Openshift Operator Hub. Additionally, we’ll deploy Kafka Topics and Kafka Connect as well, in order to export all of the existing topic events to our S3 bucket. Important! make sure that you change the endpoint URL to suit yours, or else Kafka Connect will try to expose the events with no success.

Run the script in order to create those resources:

$ cd 02-kafka && ./run.sh && cd ..

Now let’s verify all pods were successfully created:

$ oc get pods 
NAME                                                  READY   STATUS    RESTARTS   AGE
amq-streams-cluster-operator-v1.6.2-5b688f757-vhqcq   1/1     Running   0          7h35m
my-cluster-entity-operator-5dfbdc56bd-75bxj           3/3     Running   0          92s
my-cluster-kafka-0                                    1/1     Running   0          2m10s
my-cluster-kafka-1                                    1/1     Running   0          2m10s
my-cluster-kafka-2                                    1/1     Running   0          2m9s
my-cluster-zookeeper-0                                1/1     Running   0          2m42s
my-connect-cluster-connect-7bdc77f479-vwdbs           1/1     Running   0          71s
presto-operator-dbbc6b78f-m6p6l                       1/1     Running   0          7h30m

We see that all pods are in a running state and passed their probes, so let’s verify we have the needed topics:

$ oc get kt
NAME                                                          CLUSTER      PARTITIONS   REPLICATION FACTOR
connect-cluster-configs                                       my-cluster   1            3
connect-cluster-offsets                                       my-cluster   25           3
connect-cluster-status                                        my-cluster   5            3
consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a   my-cluster   50           3
music-chart-songs-store-changelog                             my-cluster   1            1
played-songs                                                  my-cluster   12           3
songs                                                         my-cluster   12           3

Those topics will be used by our streaming application to receive, transform and export those events with the proper format, into our S3 bucket. In the end, topic music-chart-songs-store-changelog will hold all the information with its final structure, so that we’ll be able to query it.

Running Presto for Distributed Querying

In this demo, we’ll use Presto’s ability to query S3 bucket prefixes (similar to tables in relational databases). Presto needs a schema to be created, in order to understand what is the file structure that it needs to query, in our example, all events that are being exported to our S3 bucket will look like the following:

{"count":7,"songName":"The Good The Bad And The Ugly"}

Each file will be exported with a JSON structure, that holds two key-value pairs. To emphasize, you can think of it as a table, with two columns where the first one is count and the second is songName, and all files that are being written to the bucket are just rows with this structure.

Now that we have a better understanding of our data structure, we can deploy our Presto cluster. This cluster will create a hive instance to store the schema metadata (with Postgres to store the schema information), and a Presto cluster that contains the coordinator and worker pods. All of those resources will be automatically created by the Presto Operator, which is offered as part of the Openshift Operator Hub as well.

Let’s run the script to create those resources:

$ cd 04-presto && ./run.sh && cd ..

Now let’s verify all pods were successfully created:

$ oc get pods | egrep -e "presto|postgres"
NAME                                                  READY   STATUS    RESTARTS   AGE
hive-metastore-presto-cluster-576b7bb848-7btlw        1/1     Running   0          15s
postgres-68d5445b7c-g9qkj                             1/1     Running   0          77s
presto-coordinator-presto-cluster-8f6cfd6dd-g9p4l     1/2     Running   0          15s
presto-operator-dbbc6b78f-m6p6l                       1/1     Running   0          7h33m
presto-worker-presto-cluster-5b87f7c988-cg9m6         1/1     Running   0          15s

Visualizing real-time data with Superset

Superset is a visualization tool, that can present visualization and dashboards from many JDBC resources, such as Presto, Postgres, etc. As Presto has no real UI that provides us the ability to explore our data, controlling permissions, and RBAC, we’ll use Superset.

Run the script in order to deploy Superset in your cluster:

$ cd 05-superset && ./run.sh && cd ..

Now verify all pods were successfully created:

$ oc get pods | grep superset
superset-1-deploy                                     0/1     Completed   0          72s
superset-1-g65xr                                      1/1     Running     0          67s
superset-db-init-6q75s                                0/1     Completed   0          71s

Nice! all went well!

Data Logic Preparation

After we have all of our infrastructure services ready, we need to create the data logic behind our streaming application. As Presto queries data from our S3 bucket, we need to create a schema, that will allow Presto to know how it should query our data, so as a table to provide the structure knowledge.

$ oc rsh $(oc get pods | grep coordinator | grep Running | awk '{print $1}')

Change your context to work with the hive catalog:

$ presto-cli --catalog hive

Create a schema, that’ll tell Presto to use the s3a connector in order to query data from our S3 bucket prefix:

$ CREATE SCHEMA hive.songs WITH (location='s3a://music-chart-songs-store-changelog/music-chart-songs-store-changelog.json/');

Change the schema context, and create a table:

$ USE hive.songs;
$ CREATE TABLE songs (count int, songName varchar) WITH (format = 'json', external_location = 's3a://music-chart-songs-store-changelog/music-chart-songs-store-changelog.json/');

Pay attention! creating the table provides Presto the actual knowledge of each file’s structure, as we saw in the previous section. Now let’s try to query our S3 bucket:

$ select * from songs;
 count | songname 
-------+----------
(0 rows)Query 20210203_162730_00005_7hsqi, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
1.01 [0 rows, 0B] [0 rows/s, 0B/s]

We have no data, and it’s OK! we haven’t started streaming any data, but we see that we get no error, which means Presto can access our S3 service.

Streaming Real-Time Events

Now that all resources are ready to use, we can finally deploy our streaming application! Our streaming application is actually a Kafka producer that simulates a media player, it has a pre-defined list of songs that are being randomly “played” by our media player. Each time a user plays a song, the event is being sent to a Kafka topic.

Then, we’re using Kafka Streams, in order to transform the data to our wanted structure. Streams will take each event that is being sent to Kafka, transform it, and write it to another topic, where it’ll be automatically exported to our S3 bucket.

Let’s run the deployment:

$ cd 03-music-chart-app && ./run.sh && cd ..

Let’s verify all pods are running, the player-app pod is our media player, while the music-chart pod is actually a pod that holds all the Kafka Streams logic:

$ oc get pods | egrep -e "player|music"
music-chart-576857c7f8-7l65x                          1/1     Running     0          18s
player-app-79fb9cd54f-bhtl5                           1/1     Running     0          19s

Let’s take a look at the player-app logs:

$ oc logs player-app-79fb9cd54f-bhtl52021-02-03 16:28:41,970 INFO  [org.acm.PlaySongsGenerator] (RxComputationThreadPool-1) song 1: The Good The Bad And The Ugly played.
2021-02-03 16:28:46,970 INFO  [org.acm.PlaySongsGenerator] (RxComputationThreadPool-1) song 1: The Good The Bad And The Ugly played.
2021-02-03 16:28:51,970 INFO  [org.acm.PlaySongsGenerator] (RxComputationThreadPool-1) song 2: Believe played.
2021-02-03 16:28:56,970 INFO  [org.acm.PlaySongsGenerator] (RxComputationThreadPool-1) song 3: Still Loving You played.
2021-02-03 16:29:01,972 INFO  [org.acm.PlaySongsGenerator] (RxComputationThreadPool-1) song 2: Believe played.
2021-02-03 16:29:06,970 INFO  [org.acm.PlaySongsGenerator] (RxComputationThreadPool-1) song 7: Fox On The Run played.

We see that we have the data being written randomly, each time a song is being played, an event is being sent to our Kafka topic. Now, let’s take a look at our music-chart logs:

$ oc logs music-chart-576857c7f8-7l65x [KTABLE-TOSTREAM-0000000006]: 2, PlayedSong [count=1, songName=Believe]
[KTABLE-TOSTREAM-0000000006]: 8, PlayedSong [count=1, songName=Perfect]
[KTABLE-TOSTREAM-0000000006]: 3, PlayedSong [count=1, songName=Still Loving You]
[KTABLE-TOSTREAM-0000000006]: 1, PlayedSong [count=1, songName=The Good The Bad And The Ugly]
[KTABLE-TOSTREAM-0000000006]: 6, PlayedSong [count=1, songName=Into The Unknown]
[KTABLE-TOSTREAM-0000000006]: 3, PlayedSong [count=2, songName=Still Loving You]
[KTABLE-TOSTREAM-0000000006]: 5, PlayedSong [count=1, songName=Sometimes]
[KTABLE-TOSTREAM-0000000006]: 2, PlayedSong [count=2, songName=Believe]
[KTABLE-TOSTREAM-0000000006]: 1, PlayedSong [count=2, songName=The Good The Bad And The Ugly]

We see that data is being transformed successfully, and that the count number increases as users play more songs.

Now, we need to make sure our pipeline works, so let’s go the our S3 service to verify all events are being exported successfully. for this purpose, I’ve used Sree as the S3 browser. Make sure you’re using the right credentials and endpoint URL:

Let’s go back to our Presto coordinator pod and try to query our data again:

$ presto> presto-cli --catalog hive
$ presto:songs> USE hive.songs;

Run the SQL query in order to fetch our data:

$ select * from songs;
 count |           songname            
-------+-------------------------------
     1 | Bohemian Rhapsody             
     4 | Still Loving You              
     1 | The Good The Bad And The Ugly 
     3 | Believe                       
     1 | Perfect                       
     1 | Sometimes                     
     2 | The Good The Bad And The Ugly 
     2 | Bohemian Rhapsody             
     3 | Still Loving You              
     4 | Sometimes                     
     2 | Into The Unknown              
     4 | Believe                       
     4 | Into The Unknown              
     2 | Sometimes                     
     5 | Still Loving You              
     3 | The Good The Bad And The Ugly

Amazing! We see that our data is being updated automatically! try running this command a few more times, and you’ll see that the number of rows grows. Now, in order to start visualizing our data, look for the Superset route, where you’ll be able to login to the console:

$ oc get routeNAME       HOST/PORT                                            PATH   SERVICES   PORT       TERMINATION   WILDCARD
superset   superset-data-engineering-demo.apps.ocp.spaz.local          superset   8088-tcp                 None

When we reach our Superset console (login with admin:admin), we can see that we can go to Manage Databases –> Create Database to create our Preto connection, make sure you put your Presto’s ClusterIP service name, at the end make sure you test your connection:

Now that we can have a more convenient way to query our data, let’s try exploring our data a bit. Go to SQL Lab, and see that you can perform our previous query. To emphasize, watch the following query, that will show how many times each song has been played:

Good! we can query data! feel free to create all your wanted visualizations and dashboards. As an example, I’ve created a dashboard that changes in real-time, as every refresh to the dashboard actually queries all the data from Presto once again:

Conclusion

In this demo, we saw how we can leverage Open Source products in order to run automatic data pipelines, all scheduled on Openshift. As Kubernetes breaks the records of adoption, organizations should consider moving their workloads towards Kubernetes, so that their data services won’t be left behind. Using Red Hat and Partner Operators, Openshift offers both day-1 and day-2 management to your data services.

Thank you for reading this blog post, see ya next time 🙂

Time Series Analysis with Facebook Prophet: How it works and How to use it

2021년 2월 22일 by hyungseok

Time Series Analysis with Facebook Prophet: How it works and How to use it

An explanation of the math behind facebook profit and how to tune the model using COVID-19 data as an example.

Mitchell Krieger

Time series data can be difficult and frustrating to work with, and the various algorithms that generate models can be quite finicky and difficult to tune. This is particularly true if you are working with data that has multiple seasonalities. In addition, traditional time series models like SARIMAX have many stringent data requirements like stationarity and equally spaced values. Other time series models like Recurring Neural Networks with Long-Short Term Memory (RNN-LSTM) can be highly complex and difficult to work with if you don’t have a significant level of understanding about neural network architecture. So for the average data analyst, there is a high barrier of entry to time series analysis. So in 2017, a few researchers at Facebook published a paper called, “Forecasting at Scale” which introduced the open-source project Facebook Prophet, giving quick, powerful, and accessible time-series modeling to data analysts and data scientists everywhere.

To further explore Facebook Prophet, I’m going to first summarize the math behind it and then go over how to use it in Python (although it can also be implemented in R).

What is Facebook Prophet and how does it work?

Facebook Prophet is an open-source algorithm for generating time-series models that uses a few old ideas with some new twists. It is particularly good at modeling time series that have multiple seasonalities and doesn’t face some of the above drawbacks of other algorithms. At its core is the sum of three functions of time plus an error term: growthg(t), seasonality s(t), holidays h(t) , and error e_t :

The Growth Function (and change points):

The growth function models the overall trend of the data. The old ideas are should be familiar to anyone with a basic knowledge of linear and logistic functions. The new idea incorporated into Facebook prophet is that the growth trend can be present at all points in the data or can be altered at what Prophet calls “changepoints”.

Changepoints are moments in the data where the data shifts direction. Using new COVID-19 cases as an example, it could be due to new cases beginning to fall after hitting a peak once a vaccine is introduced. Or it could be a sudden pick up of cases when a new strain is introduced into the population and so on. Prophet can automatically detect change points or you can set them yourself. You can also adjust the power the change points have in altering the growth function and the amount of data taken into account in automatic changepoint detection.

The growth function has are three main options:

Linear Growth: This is the default setting for Prophet. It uses a set of piecewise linear equations with differing slopes between change points. When linear growth is used, the growth term will look similar to the classic y = mx + b from middle school, except the slope(m) and offset(b) are variable and will change value at each changepoint.
Logistic Growth: This setting is useful when your time series has a cap or a floor in which the values you are modeling becomes saturated and can’t surpass a maximum or minimum value (think carrying capacity). When logistic growth is used, the growth term will look similar to a typical equation for a logistic curve (see below), except it the carrying capacity (C) will vary as a function of time and the growth rate (k) and the offset(m) are variable and will change value at each change point.

Flat: Lastly, you can choose a flat trend when there is no growth over time (but there still may be seasonality). If set to flat the growth function will be a constant value.

The Seasonality Function:

The seasonality function is simply a Fourier Series as a function of time. If you are unfamiliar with Fourier Series, an easy way to think about it is the sum of many successive sines and cosines. Each sine and cosine term is multiplied by some coefficient. This sum can approximate nearly any curve or in the case of Facebook Prophet, the seasonality (cyclical pattern) in our data. All together it looks like this:

If the above is difficult to decipher, I recommend this simple breakdown of the Fourier Series or this video on the intuition behind the Fourier series.

If you are still struggling to understand the Fourier series, do not worry. You can still use Facebook Prophet because Prophet will automatically detect an optimal number of terms in the series, also known as the Fourier order. Or if you are confident in your understanding and want more nuance, you can also choose the Fourier order based on the needs of your particular data set. The higher the order the more terms in the series. You can also choose between additive and multiplicative seasonality.

The Holiday/Event Function:

The holiday function allows Facebook Prophet to adjust forecasting when a holiday or major event may change the forecast. It takes a list of dates (there are built-in dates of US holidays or you can define your own dates) and when each date is present in the forecast adds or subtracts value from the forecast from the growth and seasonality terms based on historical data on the identified holiday dates. You can also identify a range of days around dates (think the time between Christmas/New Years, holiday weekends, thanksgiving’s association with Black Friday/Cyber Monday, etc).

How to use and tune Facebook Prophet

It can be implemented in R or Python, but we’ll focus on use in Python in this blog. You’ll need at least Python 3.7. To install:

$pip install pystan
$pip install fbprophet

Prepare the data

After reading in data and cleaning using pandas, you are almost ready to use Facebook Prophet. However, Facebook Prophet requires that the dates of your time series are located in a column titled ds and the values of the series in a column titled y. Note that if you are using logistic growth you’ll also need to add additional cap and floor columns with the maximum and minimum values of the possible growth at each specific time entry in the time series.

For demonstration, we’ll use new COVID-19 cases tracked by the New York Times on Github. First, we read and prepare the data in the form above. It doesn’t seem like there is logistic growth here so we’ll just focus on creating the ds and y columns:

Run a basic Facebook Prophet model

Facebook Prophet operates similarly to scikit-learn, so first we instantiate the model, then call .fit(ts) passing the time series through it. When calling .predict(ts), Prophet outputs a lot of information. Luckily, the developers added a method called .make_future_dataframe(periods = 10) that will easily collect all of the output in an organized way. This method outputs an empty pandas dataframe that we will fill with the forecast using the .predict(ts)method. The forecast will contain a prediction for every historical value present in the dataset plus additional forecasts for the number of periods passed through the method (in the case above 10). There are many columns of useful information in this future dataframe but the most important ones are:

ds contains the timestamp entry of the forecast
yhat contains the forecasted value of the time series
yhat_lower contains the bottom of the confidence interval for the forecast
yhat_upper contains the bottom of the confidence interval for the forecast

A .plot() function is also provided for easy plotting of the original data, the forecast and the confidence interval of the model. In this first iteration of the model we will allow Prophet to automatically choose the hyperparameters:

This outputs the following plotted forecast:

You can also add changepoints to the above plot by adding the following code:

Seems pretty decent, considering we didn’t tune any hyperparameters! Prophet picked up on a weekly seasonality of newly reported cases (probably due to differing weekend hours of testing sites) and an overall upward trend. It also added change points when during the summer and fall to better model the large increase in the rate of new cases. However, it doesn’t visually seem like a great model overall and misses many key trends in the original data. So we’ll need to tune it to get a better assessment of what is going on.

Tuning Facebook Prophet

Let’s fix some of the key problems our above model has:

Misses the downturn: Prophet was unable to incorporate the downturn in new COVID cases after the new year. This is because the default setting for the range of data points considered when identifying changepoints is the first 80% of data in the time series. We can fix this by setting changepoint_range = 1 when instantiating the model which will incorporate 100% of the data. In other situations, it may be good to keep the changepoint range at 80% or lower to ensure that the model doesn’t overfit your data and can understand the last 20% on its own. But, in this case, because we are just trying to accurately model what has happened so far, we’ll allow the adjustment to 100%.
Strength of changepoints: While its great prophet was able to create change points, it visually seems like some of the changepoints are quite weak in impact on the model, or possibly there aren’t enough changepoints. The changepoint_prior_scale and the n_changepoints hyperparameters allow us to adjust this. By default, changepoint_prior_scale it is set to 0.05, increasing this value allows the automatic detection of more change points and decreases it allows for less. Alternatively, we can specify a number of changepoints to detect using n_changepoints or list the changepoints ourselves using changepoints. Be careful with this, as too many changepoints may cause overfitting.
Possible overfitting due to seasonality: While it’s cool that it picked up on the weekly seasonality of new cases, in this particular context it’s more important to understand the overall trend of cases to possibly predict when the pandemic will end. Prophet has built-in hyperparameters to allow you to adjust daily, weekly and yearly seasonality. So we can fix this by setting weekly_seasonality = False. Alternatively, we could try to create our own custom seasonality and adjust the Fourier order using the.add_seasonality()method or we could dampen the automatic seasonality using the seasonality_prior_scale hyperparameter. However, in this case, it might be a little overkill to use either of those options

Running the model again with these changes yields:

Wow! With three small changes to the hyperparameters, we have a pretty accurate model of the behavior of new COVID cases over the past year. In this model, it predicts that cases will be near zero in early March. This is probably unlikely, as cases will probably decrease asymptotically.

Facebook Prophet is easy to use, fast, and doesn’t face many of the challenges that some other kinds of time-series modeling algorithms face (my favorite is that you can have missing values!). The API also includes documentation on how to use walk forward and cross validation, incorporate exogenous variables, and more. You can also check out this GitHub repository for the Jupyter Notebooks containing the code used in this blog.

Fundamentals aboutScalability of Software Systems

2021년 2월 21일 by hyungseok

Fundamentals about Scalability of Software Systems

A guide for building Intuition for scalable system design

Gaurav Goel

Photo by Sam Moqadam on Unsplash

Imagine that you are the owner of a grocery store. You have one billing counter. Customers come into your store, pick stuff and get in line at this billing counter to pay. You have an employee named John, behind the billing counter taking care of customers. John is a happy-go-lucky person who welcomes each customer with a smile and strikes small conversations with them while billing. John takes his own sweet time when the number of customers in the store is less. There is not much rush in the queue and nobody complains. However, on some days when the rush is more, John cuts down his conversations, moves fast, and tries to manage the customers.

Very soon, your grocery store becomes famous in the town and you see an influx of customers pouring into the store. That’s very good for the business. However, now you are facing a problem. You just have 1 employee John and a single billing counter. John is trying his best to handle the customer load but there is a physical limit to the effort he can put in. After all, he is a human being. There is another problem also. There are days when John is not well or he has to take a leave. In such instances, you have to close the store as you are mostly out-of-town for other work. The store is not “available” on such days.

You have to “SCALE” to tackle these problems

You decide to hire another employee named Sam and open a second billing counter. This eases out the “load” on John. There are 2 queues. Some customers go to Sam while others go to John. You are happy, John is happy and your customers are also happy. This also solves the 2nd problem (to some extent). On days when one of your employees is not available, the other one can try to handle the customers. Of course, the load on him will be much more but at least you don’t need to close your store that day. It will still be “available”.

Eventually, the number of customers keeps on increasing and you decide to open a 3rd billing counter and hire a 3rd employee. You know, this strategy works. Your store is always full and customers now form 3 queues. Sometimes, however, there are situations when the line is big on one billing counter than others. Customers randomly decide to stand for any counter and it may end up that one employee is extremely busy while the other two are relatively free. To overcome such a situation, you hire a fourth employee. His job is to stand at the center of all 3 counters and direct customers where to go, by viewing which counter is free and so. He kind of “balances” the load of work so that it is more or less evenly distributed among your employees.

=====================================

What has this to do with Software applications? Pretty much every bit of it.

At a very high level, a software web application will consist of a web server that hosts the application and a database server that maintains the data. In the above story –

Replace “grocery store” with the “software application”

Replace John with the web server/database server which maintain this application.

Replace customers with “users” of your application.

The architecture of this setup looks like below.

The users (or customers) access your application through a URL (www.myapplication.com). The HTTP requests are sent to a web server. The webserver returns HTML pages. Think of a web server/DB server as if John is handling the requests from customers. As the number of customers increases, the load on your setup (i.e John) increases. The customers will start experiencing a slower response or some of them may not get a response since your webserver/DB server is busy. What do you do? You have to “scale”

Vertical Scaling

Vertical scaling, also known, as “scaling up” means that you add more power to your servers. e.g Add more CPU or RAM so that it can handle more load. This is equivalent to saying that John starts to use more of his energy, cut down his chit-chats, and act fast. I know this is a very bad analogy but you get the idea.

Vertical scaling has a limitation. You can’t just add unlimited memory or all unlimited CPU to just 1 server. It’s like saying that John can act only as fast as his physical limit. Not more than that. Also, in this case, if the server goes down (John goes on a leave), your application goes down. It will be “unavailable”.

Horizontal Scaling

Horizontal Scaling or “scaling out” means that you add more servers to your setup. It’s like you are hiring additional employees and opening more billing counters in your store. This approach is more suitable for large applications with a lot of users and a lot of data processing needs. There is virtually no limit on how many servers you can add to your setup. So your application can handle as much load as possible. When your setup has a lot of servers, you can use a “Load Balancer” to evenly distribute the incoming traffic to your servers (Remember, in our story of the grocery store, we hired a 4th employee whose job was to direct the customers to the different billing counters)

This also solves the problem — when 1 server is down, the other server can still keep functioning and your application does not become “unavailable”. This concept is also known as “High Availability”.

Database Replication

In the above system design (Yes….you can call it system design or software architecture), we have taken care of web servers. What about database servers? We can scale-out database servers also. This is called database replication. The idea is simple. You have a master database and several copies of this master database (known as slaves). All inserts/updates/deletes will be done on master DB while slaves will be used as read-only data stores. This ensures the high availability of your database as well.

Image by Author

In simple terms, if you think about scalability or when you say that your software application or system design or software architecture is scalable, it simply means that your system can handle the load or requests in acceptable limits of response times.

Horizontal scaling of web servers and database servers is an excellent way of achieving it. But is there anything else we can do to improve load/response time?

Caching

When your application is being used, essentially there are requests to the database which are processed and served to the user. Some of such requests can be the same and repetitive. It will make sense if we store the data corresponding to these requests in temporary storage which is faster than a database. “Cache” is one such storage. You can add Cache to your application architecture. On receiving a request, the webserver will first check if the data is available in the cache. If yes, it will grab it from the cache and send it to the client. Else, it will query the database, store the data in the cache and send it to the client.

I hope this article gave you an approach to how to think about scalability. There are many other and complementary techniques to design and build scalable systems. I will try to cover them in subsequent articles.

ATP Tennis Cluster Analysis

2021년 2월 20일 by hyungseok

ATP Tennis Cluster Analysis

Using cluster analysis to segment tennis playing styles

Derek Austin

1 day ago·7 min read

Photo by Ryan Searle on Unsplash

In recent years, almost all sports have been part of the analytics revolution. Thrust into the public eye with ‘Moneyball’, the story of the Oakland A’s GM, Billy Beane, who pioneered a math approach to scouting, analytics has spread far and wide to all kinds of sports: basketball with the 3 point revolution, football with events like the “Big Data Bowl,” run by Michael Lopez, and even to the Premier League where Liverpool assisted by being the league leader in analytics took home the league championship.¹ Baseball was a natural starting point for the movement, as it is characterized by 1 vs 1 matchups, a pitcher vs a batter, making it is easier to quantify individual value and decouple positive and negative affects of one’s teammates. One would think that similar innovations have taken place in tennis, another game characterized by 1 vs 1 interactions, however, tennis has lagged far behind other sports in analytics. The only advanced analytic measures tennis fans have been exposed to recently are IBM Watson’s ‘keys to the match’ which highlight, the most important statistic for each player in order to secure a victory, likely from a tree based approach. The progress of tennis analytics in the public domain can be entirely credited to Jeff Sackman, the Bill James of tennis analytics.

Sackman has been tirelessly collecting match statistics, charting matches using a custom coding procedure and releasing data sets on GitHub and his site Tennis Abstract for years.² Furthermore, when Novak Djokovic added Craig O’Shaughnessy to his team, a strategy coach who preaches the value of data in the sport, he gave a symbolic boost to those who have pursued tennis analytic passion projects in the past. Like other games that involve a 1 on 1 matchup, ELO scores have been used to find relative strength of a player.³ However, As many who have played tennis competitively know, tennis is a game uniquely influenced by matchups. Having played tennis in college as a six-foot-seven big server, I loathed coming up against a smaller ‘grinder’ who would stand feet behind the baseline and put as many balls in play without going for winners. I hypothesized that with Sackman’s dataset and K-Means cluster analysis I would be able to find patterns of the different styles of play that characterize tennis, and ultimately conclude which clusters had advantages over their counterparts.

Sackman’s basic ‘box score’ dataset provides a singular row for each match played, dating back as far back as 1968. I chose to analyze files from 2011 on as a relatively arbitrary starting point, but also in order to keep the analysis relevant as the game has shifted significantly in the last 20 years. The stats provide a basic look into each match with metrics such as aces, 1st serves in, double faults, etc. The data set does not provide any rally metrics such as rally length, if the point was won on a winner, forced or unforced error, or whether it was won at the net. However, basic stats such as first serve percentage, percentage of service points won, percentage of return points won give insights into the playing style and relative strengths of each player. After loading the data into Python, I dropped the rows that had null values for any relevant statistics. Next, I created two rows for each match. The first row comprised of the winner’s stats with a unique id that Sackman provides for the winning player, and the second row would follow the same process but for the loser. This step was necessary for two reasons, first the stats are organized by winner and loser (ie. w_ace is the column for winners aces and l_ace is for losers aces) so in order to derive stats for each player, I had to create separate mappings to correspond to their stats for the match regardless of outcome. Secondly, I had to sort by the date and ID for each player in order to compute running totals that would be then used to calculate statistics such as first serve percentage after each match. To give you a feel for the data, below is a screenshot of Djokovic’s career serving statistics following his last available match against Dominic Thiem in the ATP Tour Finals.

I computed the same statistics for each player’s return games, as well as general stats like percent of all points won, and points per minute, as I thought it would likely be correlated with rally length (which, as mentioned above, was not in the data). Surprisingly, even players like Novak Djokovic only win 55% of the points which indicates a relatively low spread between the best players and average players who would win by definition 50% of their points. This would mean that a 1% improvement could be the difference in hundreds of thousands of dollars for many players.

Next I utilized the preprocessing library of Scikit-Learn in order to standardize the data and then fed it to the mini batch clustering functionality available within Scikit-Learn. I attempted different cluster sizes, ranging from 2–10 while looking for the ‘elbow’ in the graph (which seemed to be at 4 clusters). The ‘elbow’ method is a very subjective measure of optimal clusters but was sufficient for my analysis(if you need a tune-up on clustering check here).

The magic of clustering paid off yet again and I was able to find four distinct playing styles within the data. The first cluster was characterized by highest ace percentage, tallest individuals and highest first serve win probability. They won around 50% of their points and played the most amount on hard and grass courts which is to be expected. The next cluster seemed to group mediocre players together. By percentage, they won the least amount of points, spread their matches across all surfaces somewhat evenly and won the least amount of matches (38%). While this was the largest group, players were averaging the least amount of matches in the data set, which likely means these were players back and forth between the challenger tour and the pro tour, who were struggling to ever make it big. If I chose to increase the cluster size I would guess that this group would get broken down at a more granular level. Next we have the ‘all-courter’ which is essentially another way in the tennis world of saying the best individual players. Players like Federer, Nadal, and Djokovic are likely in this group as the cluster collectively wins 53% of their points, give up the least amount of break point opportunities per game and win the most amount of points on their second serve. Finally, we have the clay court grinders. They play 37% of their matches on clay which is the highest by 5%, have the lowest first serve percentage and points won on their first serve, but make up for it by generating the most break points per game on their return. Here is a detailed chart of the various summary statistics for each cluster.

Summary Statistic for each cluster based on what I was able to capture

Next, I examined the various winning percentages by each cohort against each other in total as well as on the various surfaces.

Cluster X winning percentage against Cluster Y

Cluster 2, the ‘all-courters’ faired best against all involved and my fear of playing grinders was irrational as the big servers actually won 56% of their matches against their cluster 3 counterparts. What was also interesting was those in cluster 0, the ‘big-hitters’ had a better chance of upsetting cluster 2 than the other cohorts. Intuitively, this makes sense as we have seen players like John Isner or Kevin Anderson get ‘hot’ for a tournament where it seems their power is too much for anybody. On the other side of the spectrum grinders who usually cannot overpower opponents and have to rely on tactics thus are far more consistent in their results. Below are graphs corresponding to hard, clay, and grass with each groups relative advantage over the total win % (court specific win %-total win %) which also further illuminates the big servers expertise on hard and grass courts and the grinders proficiency on clay.

While this serves as a starting point, Jeff Sackman has also published point by point data from the grand slams dating to 2011 which will give further insight into rally metrics that can hopefully further separate the clusters to group by those more offensive and net minded vs those content to rally for 10+ shots from the baseline. Look for that in Part 2 in the upcoming weeks.

Schoenfeld, Bruce. How Data (and some Breathtaking Soccer) Brought Liverpool to the Cusp of Glory. https://www.nytimes.com/2019/05/22/magazine/soccer-data-liverpool.html
Sackman, Jeff. Gitbub Home page. https://github.com/JeffSackmann
Tennis Abstract Elo Ratings. http://tennisabstract.com/reports/atp_elo_ratings.html
Sackman, Jeff. Measuring the Impact of Break Points. http://www.tennisabstract.com/blog/2019/01/04/measuring-the-impact-of-break-points/

Superhuman AI and the future of democracy and government

2021년 7월 11일2021년 2월 19일 by hyungseok

Podcast

Superhuman AI and the Future of Democracy and Government

Ben Garfinkel explores what we can — and can’t — predict about the future of humanity

Jeremie Harris

1 day ago·53 min read

To select chapters, visit the Youtube video here.

Editor’s note: This episode is part of our podcast series on emerging problems in data science and machine learning, hosted by Jeremie Harris. Apart from hosting the podcast, Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

APPLE | GOOGLE | SPOTIFY | OTHERS

As we continue to develop more and more sophisticated AI systems, an increasing number of economists, technologists and futurists have been trying to predict what the likely end point of all this progress might be. Will human beings be irrelevant? Will we offload all of our decisions — from what we want to do with our spare time, to how we govern societies — to machines? And what is the emergence of highly capable and highly general AI systems mean for the future of democracy and governance?

These questions are impossible to answer completely and directly, but it may be possible to get some hints by taking a long-term view at the history of human technological development. That’s a strategy that my guest, Ben Garfinkel, is applying in his research on the future of AI. Ben is a physicist and mathematician who now does research on forecasting risks from emerging technologies at Oxford’s Future of Humanity Institute.

Apart from his research on forecasting the future impact of technologies like AI, Ben has also spent time exploring some classic arguments for AI risk, many of which he disagrees with. Since we’ve had a number of guests on the podcast who do take these risks seriously, I thought it would be worth speaking to Ben about his views as well, and I’m very glad I did.

Here were some of my favourite take-homes from our conversation:

Unsurprisingly, predicting the future is hard. But one of the things that makes it especially hard when it comes to artificial intelligence and its likely impact on the economy, is that AI seems likely to challenge many of the assumptions that are baked into our standard economic models. For example, the very idea that markets consist of people who’ve made money and are looking to spend it on products may not generalize to a world where most buying and selling decisions are being made by machines. Likewise, we currently assume that there’s a pretty clear distinction between labour (the work that people put in to build stuff and deliver services) and capital (the tools, equipment and stuff that they build, or use to build other stuff). It’s not clear which of out economic intuitions will generalize to a world where AI systems count as capital, but are also doing most of our labour.
One active debate among economists, historians and futurists is whether the growth and development of the global human economy has been smooth and gradual, or step-wise and sharp. For example, some point to the Industrial Revolution, the Neolithic Revolution and other similar events as moments where economic development increased discretely and abruptly, whereas others see these as merely the moment that a level of ambient, continuous development finally became noticeable. Interestingly, people’s views on the relative smoothness or sharpness of human economic history plays an important role in the way they imagine the transition to an AI economy. If you generally think that economic growth has always been continuous and gradual, you’re less likely to think that AI will lead to a discontinuous, transformative leap in our day-to-day lives over a short period of time.
Ben is skeptical of certain “classic” arguments for AI risk. While not dismissing them completely, he argues out that many of them are unnecessarily abstract. He also makes the case that the emergence of increasingly systems like OpenAI’s GPT-3 have given us the opportunity to see how concrete and somewhat general AI systems behave in practice, and the results, he argues, suggest that concerns around AI risk from recursively self-improving systems may not be on particularly solid ground. It’s *really* hard to unpack these arguments in bullet point form here, so if you’re interested in this aspect I really do recommend listening to the episode!

You can follow Ben on Twitter here (though he hasn’t tweeted yet :P) or follow me on Twitter here.

Links referenced during the podcast:

Ben’s page on the Future of Humanity Institute’s website.

Chapters:

0:00 Intro
1:21 Ben’s background
3:14 The risk of AI
9:57 The value of money
13:38 AI as a participatory phenomenon
16:01 AI and GDP
27:11 Evolution of life
30:36 The AI risk argument
45:23 Building these systems
51:29 Feedback of human self-improvement
53:54 A shift in ideas
1:07:38 Wrap-up

Please find below the transcript:

Jeremie (00:00:00):
Hey, everyone, Jeremie here. Welcome back to the Towards Data Science Podcast. I’m really excited about today’s episode because we’re going to be taking on a lot of long-termist, reformed looking, and semi futuristic topics related to AI. And the way AI technology is going to shape the future of governance. Are human beings going to just become economically irrelevant? How many of our day-to-day decisions are going to be offloaded to machines? And maybe most importantly, what does the emergence of highly capable and highly general AI systems mean for the future of democracy and governance itself? Those questions are impossible to answer with any kind of certainty, but it might be possible to get some hints by taking a long view at the history of human technological development.

Jeremie (00:00:41):
And that’s exactly the strategy that my guest Ben Garfinkel is applying in his research on the future of AI. Now, Ben is a multidisciplinary researcher who’s working on forecasting risks from advanced technologies, including AI at Oxford’s Future of Humanity Institute. Ben’s also spent a lot of time exploring some classical arguments for AI risk, many of which you’ll have encountered on the podcast. We’ve had a lot of guests on to discuss and explore those in detail and many of which he disagrees with. And we’ll be exploring his disagreements, why he has them, and where he thinks the arguments for AI risk are a little bit shaky. I really enjoyed the conversation. I hope you do too. Ben, thanks so much for joining me for the podcast.

Ben (00:01:19):
Yeah. Thanks so much for having me.

Jeremie (00:01:21):
I’m really happy to have you here. Your focus is on a whole bunch of long-termist issues, a lot of them around AI. Before we dive into the meat and potatoes of that though, I’d love to have a better understanding of what brought you to this space. So what was your background coming in and how did you discover long-termism in AI?

Ben (00:01:38):
Yeah, so it’s actually I guess, fairly roundabout. So in college I studied physics and philosophy and was quite interested in actually the philosophy of physics and was even considering going to grad school for that, which fortunately I did not do. And yeah, I guess through philosophy, I started to learn more about ethics and encountered certain ideas around population ethics. The idea that there’s different questions around how we should value future generations in the decisions we make and what our obligations are to future generations. Or how strong the obligation is to do something that has at least some use to other people. And then through that, I became increasingly interested in long-termism, and also trying to figure out something that seemed useful. And I came to think that maybe philosophy and physics was not that.

Ben (00:02:28):
And I got actually very lucky not just around this time, as I was trying to look more into long-termist or futuristic topics, I happened to meet a professor, Allan Dafoe, who was at Yale at the time. He was just himself pivoting to work on AI governance issues. And I think he put up a call for research assistants when I was still a senior there. And I was interested in the topic, I’d read a little bit about AI risk. I started to read for example, the book Superintelligence and I hadn’t really engaged in that area, but seemed like there may be some important issues there. And an opportunity jumped up and I started working with Allan. And now several years later, I’m actually still working with Allan, and I’ve just become fairly convinced that working on risks from emerging technology is at least a pretty good thing to do from a long-termist perspective.

Jeremie (00:03:14):
And this is actually a beautiful segue into, I think one of the main topics I really wanted to talk about. And that is this idea that you spent a lot of time thinking about existential risk from AI and the arguments for it. Many of which I know that you’re not actually fully sold on. Maybe we can start there, what’s the nature of the existential risk that people generally in particular, Allan and you are worried about when it comes to AI? And then we can maybe get into the counter-arguments to those arguments as well, but just for starters, what is that risk?

Ben (00:03:44):
Yeah, so I don’t think that there’s really a single risk that’s, at least really predominant in the community of people thinking about the long-term impacts of AI. So I’d say there’s a few main, very broad, and somewhat nebulous categories. So one class of risks very quickly is I’d say are risks from instability. So a lot of people, especially in the international security domain are worried about for example, lethal autonomous weapons systems, maybe increasing the risk of conflict between states. Maybe accidental, flash conflicts or potentially certain applications of AI, let’s say moving second strike capabilities and increasing the risk of nuclear war. Or they’re worried about great power competition. And the main vector of concern they have is maybe something about AI will destabilize politics either domestically or internationally, and then maybe there’ll be war which will have lasting damage or just some other negative, long conflict.

Ben (00:04:43):
There’s another class of concerns that is less focused on there being, let’s say some specific conflict or collapse or war. And is more focused on the idea that maybe there’s some level of possible contingency in how AI reshapes society. So you might think that certain decisions people make about how the government and AI will have lasting effects that carry forward and affect future generations. And in fact, for example, things like how prevalent democracy is or what the distribution of power is, or just various other things that people care about that maybe for example, bad values being in some sense entrenched.

Jeremie (00:05:23):
Because on that side, I imagine that’s a very, obviously it’s complicated area. But what are some of the ways in which people imagine AI transforming the extent to which let’s say democracy is a attractable mode of governance in the future?

Ben (00:05:36):
So just on democracy there’s obviously some speculative edge to this, but one argument for being worried about democracy is that democracy is not really normal. If you look across broad, sweeping view in history, back to the first civilizations, it’s not that uncommon for there to be, let’s say very weekly democratic elements. So it’s not completely autocracy, there’s some sort of body, say, Roman Senate or something, but in the case of Roman which is a well-known one. But it’s very far from what we have right now, which is like almost universal suffrage in a large number of countries with very responsive governments and consistent transfer for more. That’s extremely rare from a historical perspective. And even if things were not fully autocracy or somewhat coming before, this is a very different thing the past couple 100 years. And there’s different theories about why this modern form of democracy has become more common. And there’s a lot of debate about this because it’s hard to run RCTs. But a lot of people do point to at least certain economic changes that happen around the industrial revolution as relevant.

Ben (00:06:43):
So one class of change that people sometimes bring up is Androform, was a really serious concern before the industrial revolution. Some of the concern was that if you give a lot of common people power of the government, or leverage the [inaudible 00:06:56] that redistribute land, which is the primary form of wealth from wealthy actors more broadly should be very disruptive. And that’s as countries industrialized and land became less relevant as a form of wealth, maybe these land reform concerns became less of a blocker. You no longer had this land aristocracy, had this very blunt policy fear.

Ben (00:07:18):
And other concerns as well, is that the value of labor went up as well, just as productive increased. And this gave people in some nebulous sense, more bargaining power because you have the typical worker just what they did and more value. And they could create a larger threat by threatening to basically just remove their labor. Or organizations also thought to maybe have been relevant, like maybe people being packed in the cities would be easier to organize and actually have successful revolutions. And there’s lot of different factors that people basically point to as being economic changes that maybe helped democracy along its way or helps at least partly explain why it’s more prevalent today.

Ben (00:07:52):
So one concern you could have quite broadly is if the prevalence of democracy is in some way contingent on certain material or economic factors. Then that have only really held for the past couple 100 years. Maybe this isn’t normal, maybe if you just change a lot of economic and technological variables, it’s not going to hold. And there’s some more specific arguments here. So one pretty specific argument is just if the value of human labor goes very low, even goes to zero in most cases, because you can just substitute capital for labor. Because AI systems can do anything that people can do, maybe when we reduce the power of workers, if you can automate law enforcement or putting down the uprisings because military technologies can be automated as well.

Ben (00:08:33):
Maybe that makes authoritarian governments more stable. It means that they don’t even make concessions out of fear of uprisings. Maybe as well if the value of labor goes to zero, then at that point might become very heavily based on just who owns capital or who owns machines basically. And maybe it creates a system, a situation that’s very analogous to the little concerns about land reform. Where wealth wasn’t really based on these more nebulous things would divide people’s labor, didn’t really play a role, which is largely, there’s a thing that you own that you basically collect rents on. If you returned to that system, then maybe that’s also not good for the stability of democracy as well.

Ben (00:09:09):
So there’s an outside view perspective, which is just, this is a rare thing. Maybe we shouldn’t expect it to last, we change a lot. And then there’s some more inside view arguments that maybe will make authoritarian governments more stable, and make people were more worried about giving power to [inaudible 00:09:24].

Jeremie (00:09:24):
It’s really interesting how entangled all these issues are and how difficult it is to articulate a coherent vision of what the future might look like when all these transformational changes happen. One of the things that keeps coming to mind for me when we start talking about what’s going to happen with democracy, what’s going to happen with the economies. And then the power of labor to negotiate and so on, is the underlying assumption that we have any kind of market structure whatsoever, to the extent that you have all labor being done by machines.

Jeremie (00:09:57):
One of the, I guess almost silly questions that I would have is what is the value of money in that context? What is the value of price discovery? How does price discovery happen in that context? And what even does redistribution mean if… It’s not that we’re necessarily in a post scarcity situation, you would expect gradients of scarcity. But anyway, I’m not even sure what thought I’m trying to articulate here, but it looks like you have something to throw in there.

Ben (00:10:23):
So I think this is a really serious issue. I think we should not expect ourselves to actually be able to imagine a future with very advanced AI in any level of detail and actually be right. So an analogy I’ve sometimes used is I think there are certain aspects of a world where AI systems can at least do all the things that people can do. We can reason about to some extent, abstractly. We do have these economic models, we have labor and you have capital, and you can ask about what happens if you can substitute capital for labor. And even project is very abstract point of view. And there’s maybe some reason to hope that these theories are sufficiently abstract, that even if we don’t know the details. There’s still some reason to think that there’s sufficient general abstract that we can still use them to reason about the future. But there’s definitely a concern, like anything that becomes specific on how the governments work. We’re probably going to be imagining just the functionality of government’s quite wrong.

Ben (00:11:19):
So one analogy I’ve sometimes used is let’s imagine that you’re in say 1500 and someone describes the internet to you in very abstract terms of it’s like communication will be much faster. Retrieving information and learning things would be much quicker. And it gives you some of the abstract properties of it. There are some stuff you can probably reason about.

Ben (00:11:40):
So you might think, for example, “Oh, you can probably have less autonomy because people can communicate with them more quickly as opposed to them being overseas and out of contact. Or businesses can probably be larger because these coordination costs will probably go down.” And some stuff you can probably say about that would actually be true, or you could say, “Oh, maybe people work remotely,” and you probably don’t even know a lot about the details. But if you try to get really specific about what’s going on with that you’re probably going to be imagining it just completely, completely wrong. Because you have no familiarity whatsoever of what a computer actually is like, or how people interact with them.

Ben (00:12:15):
You’re not going to get details at the level of like, there’ll be this thing called Reddit and GameStop stock. There’s all these issues, which there’s no chance you’re ever going to foresee in any level of detail. And there’s lots of issues you might imagine that just won’t really apply because you’re using abstractions that somehow don’t fit very well. So this is a bit of a long-winded way of saying, I do think we have some theories and methods of reasoning that are sufficiently abstract and I expect them to hold at least a little bit. But I think there’s lots of stuff that we just can’t foresee. Lots of issues that we just can’t really talk about. And lots of stuff we say today they’ll probably end up being silly from the perspective of the future.

Jeremie (00:12:51):
Yeah, I would imagine so, “This time it’s going to be different,” is a dangerous thing to say at any given time. But when it comes to the next stage of the AI revolution, if you want to call it that. I know that’s the language you’ve tended to use as well and it seems apt in this case. One of the things that I do wonder about is a kind of almost like abstraction leakage where the abstractions that we rely on to define things like markets. This is one of the very fundamental elements of our reasoning when we’re talking about predicting the future. Markets implicitly revolve around people, because ultimately prices are just what individual human beings are willing to pay for a thing. To the extent that we broaden our definition of what a market participant could be.

Jeremie (00:13:38):
And here we get into questions of like, how do we consider an AI agent? At what point is it a participatory member of society? And at what point does price discovery really revolve around the needs and wants of non-human systems and things like that? I guess that’s where I start to wonder, this is a non-constructive perspective by default. So it’s not helpful for me to say like, “Markets are a bad abstraction,” but is that an issue that you think is serious or?

Ben (00:14:06):
Yeah, so yes, I do certainly think that there’s an issue and I think you point out a good, specific problem of, we have this very firm distinction between… People are very different than machines and software at the moment. It’s a very [crosstalk 00:14:19] like economic actors versus stuff about the economic [inaudible 00:14:23]. And there’s some degree of blurring of a corporation, for certain purposes has [inaudible 00:14:29] which are in some ways similar to a person. But the distinction is fairly, fairly strong. Even just between capital and labor, there aren’t any ambiguities around this at the moment.

Ben (00:14:41):
But if you think that very broadly capable, gentle, AI systems will exist in the future. We think that maybe people have interesting relationships with the AI systems where they create assessments, which are meant to pursue their values. I think a lot of distinctions that we draw might actually become a lot more ambiguous than they are today. And the way in which they become ambiguous in the future might make it so that any reason we do that relies on really crisp distinctions, might just fail in ways which are difficult to foresee at the moment.

Jeremie (00:15:12):
Yeah. It’s an interesting risk to predict because it really is unpredictable and fundamentally challenging. It seems like one of the issues there too, and you explore this in some of your work actually on the history of technology is which metric you’re even going to look at to tell the story of the evolution of this technology. Can you speak a little bit to that, your historical outlook and which metrics you find interesting and why they may or may not be relevant in the future?

Ben (00:15:36):
Yeah. So I think one metric that I think people very frequently reach to is global world product or GDP. And GDP is interesting as a metric because the thing it’s meant to measure is basically to some extent productive capacity, like how much stuff can you produce or stuff that people value can you produce. And-

Jeremie (00:16:01):
I have a stupid question. So what is GDP? What is the actual definition of GDP?

Ben (00:16:08):
So at least nominal GDP, you add up the total price of all of what are called final products that are sold within an economy. So a final product is basically something that is something like an end result. If you sell someone screws, and then they sell the screw to someone who uses the screw to make like a ceiling fan or something. The screw isn’t meant to be counted because you’re double counting. If someone buys a ceiling fan and they buy the screw when they buy the ceiling fan, they’re also buying the screw as well. So it’s meant to be basically adding up the total essentially sell price of all the stuff that’s bought or sold within an economy excluding the intermediate products.

Ben (00:16:48):
But then people also often want to talk about real GDP, which is different than nominal GDP. So nominal GDP is just, you add up basically all the prices. And one issue of nominal GDP is if you have inflation, then you can have nominal GDP increase for reasons that have nothing to do whatsoever with the actual underlying stock. So government decides to print more money, suddenly the price of everything goes up by a factor of 1,000, but you still have the same stuff. It doesn’t really feel like GDP growth has been extremely rapid in a nominal sense, but it’s not really telling you that actually you’re producing more stuff.

Jeremie (00:17:25):
Yeah. Venezuela is doing great.

Ben (00:17:27):
Yeah, exactly. So real GDP, it’s meant to be adjusting for this. And at least very roughly speaking the way it works is you try to define everything relative to the prices that existed a certain point in time in the past. So let’s say you have an economy that exists, the only product sold is butter and the price of butter goes up by a factor of 1,000 for some reason because of inflation. But you only double the amount of butter that you sell in the economy. Real GDP will just say, “Oh, because the amount of butter you sold increased by a factor of two. The size of your economy has only increased by a factor of two.” And the size of the economy is defined as take the price of butter in the past, multiply it by how many units exist today and that’s real GDP. And it gets pretty complicated because people keep introducing new products over time. So how do you compare the real GDP of the economy in 2020 versus the economy in the 1700s, given that most of the stuff that people buy in 2020 didn’t exist in 1700? How do you actually do that comparison? And there’s various wonky methods people use, they don’t really understand properly.

Ben (00:18:36):
But in asking that question, you’ve also gotten to one of the main issues with GDP. It’s meant to be tracking the productive capacity of society, like how much stuff we make basically. And if you use real GDP, over short periods of time, it seems fairly unproblematic because you’re not typically introducing that many new products. But over a long period of time, it becomes increasingly nebulous how these comparisons actually work. So very blunt comparisons still are pretty much fine. So you can still say like GDP per capita in 10,000 BC versus today. Even if I don’t know exactly how to define GDP per capita for like 100 other societies, I’m still quite confident it was lower.

Ben (00:19:21):
So it’s in some sense like a blunt instrument, I think its usefulness really depends on how precise you want to make your discussions or predictions. So let’s say someone makes a very bold prediction that the rate of GDP per capita growth will increase by a factor of 10 due to automation. If someone makes a bold prediction like that, it is a little bit ambiguous what real GDP means in some crazy futuristic economy. But even if you look a little bit fuzzy on it, the difference between GDP, the rate of growth, didn’t change, the rate of growth increased by a factor of 10 is still blunt enough. It’s a useful way of expressing a claim.

Ben (00:19:57):
So that’s a long-winded way of saying, I think GDP or GDP per capita is often pretty good as a proxy of just how quickly is productive capacity increasing. It’s useful for things like the industrial revolution, really clearly shows up in GDP per capita. Or when a country seems really stagnant, like undeveloped country isn’t developing, GDP per capita is typically pretty flat. And then when China, for example, started to take off in a really obvious qualitative sense GDP per capita tracked that pretty well. So it’s useful for that, but it also has various issues. And then there are also issues beyond that of like, often people want to use it as a proxy for how good people’s lives are, like GDP per capita.

Ben (00:20:38):
But there’s various things that don’t typically get factored into it, like the quality of medical care, isn’t very directly factored into it, air pollution isn’t factored into it. If everyone was just very depressed or anesthesia, the value of anesthesia being developed just really does not show up. There’s a classic paper by William Nordhaus that shows that quality improvements in lights, the fact that light bulbs are just way better than candles, more than 100 years ago doesn’t really show up. So it’s a long-winded way of saying same fiscal lots of issues, at least as a crude measure, pretty good. But doesn’t necessarily correlate that actually as well as you might help with wellbeing and other things of interest.

Jeremie (00:21:15):
It is interesting that when you tagged on that last piece, it doesn’t correlate well with wellbeing. I can’t think of a better encapsulation of a kind of alignment problem. Basically the problem of coming up with a metric that says, here’s what we want. Humans are really bad, or it’s not that we’re bad. It may just be a genuinely difficult problem to specify metrics that even make sense. And you see what the stock market, we decide to fixate on this one metric. And for a while, the stock market was a great measure of in general, how’s the economy doing, how’s the average person doing? But then there’s a decoupling and we end up with very divergent stock markets versus the lives of the average person. Anyway, sorry. It didn’t mean to butt in but you were mentioning the-

Ben (00:22:00):
Yeah. So I should just say as a little caveat, I think at the moment, GDP actually is pretty good as a metric. Where if you often define the things you care about, like life expectancy or life satisfaction. It does actually currently, there’s often like a pretty strong correlation. And I think you’re just like, didn’t know anything, you’re behind [inaudible 00:22:17] or something, you need to pick a country to live in. And the only thing you get is the GDP per capita. This is often going to be useful information for you. I guess my thought is more in line with the alignment concerns, I wouldn’t be surprised if it becomes more decoupled in the future.

Ben (00:22:30):
Especially if let’s say, imagine we eventually just totally replaced labor with capital and machines and people no longer are really working for wages. And economic growth is mostly machines building other machines and workers aren’t really involved. I would not be shocked if the economy increases by a factor of 10, but a person’s life does not increase by a factor of 10.

Jeremie (00:22:47):
Yeah. That’s interesting as well and raises the question of what, and this is back to price discovery, which is a big aspect of GDP. There are so many areas where things get complicated. But what’s also interesting is looking at some of the work that you put together on this historical exploration of technology. A lot of these metrics really are correlated. To some degree, it just doesn’t matter what you’re measuring, something dramatic has happened over the last 2000 years or the last 20,000 years. However, you want to measure it, either cultural revolution, neolithic revolution, industrial revolution. And it’s almost as if the human super-organism, all the human beings on planet earth are an optimization algorithm that’s just lashed onto some kind of optimum or local optimum or whatever. And we’re now climbing that gradient really steeply.

Jeremie (00:23:44):
Do you see AI as like a continuum limit of that? Is that just like the natural next step? Or should we think of it as a quantum leap, like a step function, things are just qualitatively different?

Ben (00:23:56):
Yeah. I think that’s a really good question. And I do think that this is a debate that exists in terms of how exactly to interpret the history of economic growth or increased social capacity. Or whatever kind of nebulous term you want to use to describe people’s ability to make stuff or change stuff or get stuff done in the world. And there’s actually a debate that exists for example, between different interpretations of the industrial revolution. So one interpretation of the industry revolution which occurred between roughly 1750 to 1850, in the UK and some surrounding countries is that up until the industrial revolution, growth was very stagnant. And then there was some change, some interesting pivot that happened that maybe took place over, maybe also another century on the other end of the industrial revolution. Where for some reason the pace of technological progress went up.

Ben (00:24:55):
And people switched away from an agriculturally based economy to industrial economy. And people started using non-organic sources of energy. So it’s no longer wood and animal fertilizer. It’s now fossil fuels and energy transmitted by electricity and stuff like this. And R&D is now playing a role in economic growth, which previously it didn’t really. And there’s some interesting phase transition or something that happened over a couple 100 years. We just transitioned from one economy to almost like a qualitatively different economy that could just grow and change faster.

Ben (00:25:29):
There’s another interpretation now that basically says that there’s actually this long run trend across at least the history of human civilization of the rate of growth getting faster and faster. And this interpretation says that as the overall scale of the economy increases, for that reason, the growth rate itself, just growth keeps going up and up. And this interesting feedback loop where the scale of the economy kept getting bigger and growth rate kept getting larger and larger and really visibly exploded in the industrial revolution. Just because this is where the pace finally became fast enough for people to notice this, but there was actually like a pretty consistent trend. It wasn’t really a phase transition.

Ben (00:26:12):
And there’s some recent work by for example, David Roodman, who’s an economist who does work for the Open Philanthropy Project. There’s a recent report he wrote I think, modeling the human trajectory, which argues or explores this continuous perspective. And there’s a debate in economic history as well. So there’s an economist, Michael Kramer has argued for this smooth acceleration perspective and lots of economic historians who have argued. Actually, there’s some weird thing where you switch from one economy to another.

Ben (00:26:42):
I’ll just say that there’s competing interpretations. So one just says every once in a while, it’s a bit weird, it’s a bit idiosyncratic. Something happens, some change that’s a bit discontinuous. And we switched to a new economy that can grow faster. And another interpretation says, no, actually this is a pretty consistent forest. Just things keep getting faster and faster, and it’s not phase transitions and it’s not discontinuity, it’s just there’s a smooth, really long run trend of just the world keeps accelerating more and more

Jeremie (00:27:11):
It’s interesting how that entangles almost like two different sub-problems. One of them is do humans learn almost continuously? In other words, is it the case that cave people were gradually generation on generation actually picking up more and more skills as they went, that it only become obvious when you look over like 10,000 years. Or is it the case that no, they’re basically stagnant, everything is truly flat and then you get some takeoff. It almost feels like this could be viewed as part of an even deeper question where if you keep zooming out and keeps zooming out. It no longer becomes a story of humanity iterating towards some future economy with AI is taking over. But rather moving from completely a biotic matter and the big bang, purely no value creation whatsoever to…

Jeremie (00:28:01):
I guess that has to be a step function, that first moment where life evolves. This is where I’m curious about, that perspective would seem to argue for more the quantum leap angle or the step function approach, unless I’m mistaken.

Ben (00:28:15):
Yeah. So I think that’s right. Definitely at least intuitively there’s certain transitions in history where it really seems like just something different happening. So the first self-replicating thing that can qualify as a life form, it seems like that has to be like a fairly discrete boundary in some sense. Or those things like, I really don’t know evolutionary history, but I think first you carry out something like mitochondria became part of the cell. This is a fairly discrete event, I believe where one of the organisms were smaller than the other, [inaudible 00:28:48] in it and the whole eukaryotic branch of life evolved from that. And various interesting things like people falling from that, where that also seems like something that intuitively is a discontinuous change that I don’t exactly know.

Ben (00:29:06):
So it does seem like intuitively there are certain things. And then another one as well is even in the industrial revolution where people were starting to do agriculture in a big way. I think the general thinking is that this was actually fairly rapid in a historical sense or things that could qualify as humans have existed for 10s of 1,000s of years. And then maybe over the course of a few 1,000 years people in like Western Asia and later other continents, transitioned to sedentary agricultural civilizations.

Ben (00:29:35):
And I think the thought is you had like a massive ice age for 100,000 years roughly, and then the ice age ended. And the climate changed and it became in some ways more favorable for people actually transitioning to sedentary agriculture, and then it just happened very, fairly quickly. So yeah, I do think that you’re right that there are some historical cases where it really does feel like at least without me personally knowing a lot about them, it feels like a discontinuous change. And I do also think that will probably be the case to some extent for AI. I don’t think it’s going to be a, you wake up tomorrow thing. But I do think that if we eventually reach full automation or if the growth rate again increases due to AI. People probably won’t look at it just as a stable continuation of economic trends that have existed since 1950. That right now we have this very steady rate of economic growth and we have this pretty steady filling rate of automation. And if the growth rate ever goes nuts, I think that people will feel like there was some inflection point or pivot point or some tipping point involved there.

Jeremie (00:30:36):
That’s actually as good a transition point as any of that could imagine to the second area you’ve been looking at that I really want to discuss, which is your views on AI safety… Not AI safety necessarily, let’s say AI risk and this idea of a smooth transition to an AI powered world, or let’s say a very abrupt transition to a kind of dystopic or existentially deadly scenario. So do you have some views on this? Maybe I’m just going to kick things off with that. So can you lay your thoughts are on where you think the AI risk argument is strong and maybe where it fails?

Ben (00:31:14):
Yeah. So I think I might just initially say a little bit about the continuity question or at least the relevance to the continuity question. So as you alluded to, this is also the debate people have about AI is how abrupt will the… Let’s assume we eventually get to a world where AI systems can basically make human labor obsolete and do all sorts of other crazy things. How abrupt transition will that be? Will it be the sort of thing, like an analogy to the industry revolution, where it’s a period of many decades and it’s this gradual thing that spreads across the world in a gradual way?

Ben (00:31:48):
I think even things like, steam power, people transitioning from not using fossil fuels to using them, that was an extremely long transition. Will it be more like those cases or will it be something that feels a lot more abrupt? Will there, for example be a point like a two-year period, where we went from stuff being basically normal to now everything is AI or even less than two years. And this is the debate that sometimes happens in the long-termist or futuristic [inaudible 00:32:15]. And it seems relevant in some ways where one of, and some ways should be something that increases risk or eventually reduces it.

Ben (00:32:24):
So in terms of increasing risk, one thing that a sudden, or really rapid change implies is it can come a little bit out of nowhere. So it’s very continuous, you see a lot of stuff that’s happening coming well ahead of time. Whereas if it’s really sudden, if it’s a process that would take two years, and that means that in principle two years from now, we could be living in a very different world if it just happens to happens. And there’s less time to get prepared and less time to get used to different intermediate levels of difference and do trial and error learning and get a sense of what the risks are. What the risks aren’t. If we talk this out and realize opportunity to see how to find and get used to the problems and come up with intermediate solutions and learn from your mistakes. And I think the largest risk with this is probably relevant to risks related to misaligned AI which is, I guess, the last major category of risk. And these are also a little bit diverse and I believe you’ve had some previous people on the podcast talk about them.

Ben (00:33:21):
But a lot of the concerns is basically boiled down to lots of AI systems we develop in the future will probably to some extent behave as though they’re pursuing certain objectives. Or trying to maximize certain things about the world. In the sense that like [inaudible 00:33:35] and the system makes predictions about offense rates in a criminal justice perspective is in a sense, trying to increase predictive accuracy or that sort of thing. And the concern is that the goal is that AI systems have will in some sense diverge and [inaudible 00:33:58] people tend to have, and that this will lead to disastrous outcome. We have AI systems which are quite clever and quite good at achieving whatever goals they have just doing things that differ from what people want.

Ben (00:34:12):
So speed is really relevant to this because if you think that this is going to be this pervasive issue of someone creates an AI system and deploys it. And then there’s some sort of divergence between it’s goals and goals that people have, and this causes harm. It seems like if there’s a really continuous transition to AI systems playing larger and larger roles in the world, that there’s probably quite a lot of time to know this less catastrophic versions of this concern or learn what works or doesn’t work. Not everyone is fully convinced that just gradualness and trial and error is enough to completely resolve the issue. But it seems like surely it’s helpful to actually be able to see more minor versions of the concern and come up with solutions that work in minor cases. Always this stuff is very sudden then, and let’s say we wake up tomorrow and we have AI systems that in principle can just completely replace human labor, could run governments, could do whatever.

Ben (00:34:59):
If we, for whatever reason, decide to use them. And they had goals which were different than ours in some important way, then this is probably a lot more concerning and we might not see issues coming. Yeah. So I guess to your question, what are the reasons why this might not be a major concern or just what’s the set of arguments for it being a concern one way or the other?

Jeremie (00:35:21):
Well, actually I think there’s an even more specific concern that you’ve taken a lot of time to unpack. And it’s this concern around the argument that Nick Bostrom makes in his book, Superintelligence. Just to briefly summarize, to tee it up here, the idea is, and I’m going to butcher this and please feel free to highlight the various ways in which I butcher this. But the idea is something like, if we assume that AI teams, let’s say OpenAI and DeepMind and whatever else are gradually iterating and iterating and iterating. One day, one of them has an insight or purchases, a whole bunch of compute, or gets access to a whole bunch of data. That’s just the one thing that’s needed to bump a system from like pathetic, little GPT-3 to now all of a sudden human level or above.

Jeremie (00:36:06):
That system because it’s human level or above, it may know how to improve itself because humans know how to improve the AI systems. So maybe it figures out how to improve itself and you get some recursive loop because loops very tight, the AI can improve itself. And eventually it’s so smart that it can overpower, let’s say its captors with its intelligence and take over the world and lead to a completely disastrous outcome. Is that at least roughly right?

Ben (00:36:30):
Yeah. So I think that’s basically roughly right. Yeah. So one way to think about is I think there’s a spectrum of these alignment concerns. And some of them are in the more, maybe the future nebulous perspective where we create lots of AI systems gradually over time and their goals are different from ours and there’s a gradual loss of control of the future and that sort of thing. And there’s so much more extreme where it’s like there’s a single AI system and arrives quite suddenly. And it’s in some sense broadly superintelligence and it doesn’t really have major precedents. And that system individually quite rapidly causes havoc in world, like there’s some major jump to this one single very disruptive system which is definitely the version of concern. It’s emphasized in things like Nick’s book Superintelligence and then the narrative, I guess you just described.

Ben (00:37:18):
So a lot of my own thinking about AI risk has been a lot about this more extreme end of the spectrum so that concern appears in places like superintelligence for a couple of reasons. One I think it’s the version of I first encountered and that made me especially interested in it which I guess is a partial just personal reason for interest.

Ben (00:37:39):
And the other others I think that this is just, even if lots of AI alignment researchers, don’t primarily have this version of concern in mind. I think it’s still quite influential and pretty well-known. And often if someone knows anything about AI risk, this is the version of concern that comes to mind. So that sounds I think it’s maybe a special worth paying attention. So some of my thinking has been just about the question of like it plausible that you actually have this very sudden jump from you don’t really have major AI systems of interest what is a bit like it is today. And then suddenly some researcher somewhere has this major breakthrough and you end up with this single system. And I guess I’m fairly skeptical of this for maybe boring reasons.

Ben (00:38:15):
So one initial boring reason is just that’s not the way technology tends to work. If you start from the perspective of like, let’s look at how technology normally transforms the world. It’s normally the case that it’s this a protractive process that takes decades where someone develops something and then it’s a long process of improvement. And then it’s the point in some sectors before other sectors and it’s useful in some areas before other areas. And then people need to develop complimentary inventions to take advantage of it. And people need to figure out how to actually use it appropriately. And there’s lots of tweaking and issues you don’t foresee that make it a slow process. So like electricity it’s, I think the electric motor sometime in the early 19th century, I believe it’s invented. But then electric motors don’t predominant in American factories until something the 1930s.

Ben (00:39:02):
Or the first digital computers middle of the 20th century but from like ’90s, that they really show up in productivity statistics in a big way. And even then, not really and still loads of countries, not like that pervasively used in different important contexts. And even not in a sense like that larger portion of the economy. So it becomes a start from there and it’s like you don’t look too specifically at the details of AI and say like, “What would I expect if it’s like any other technology we’ve ever had?” Probably it’s economic transformation, it’s going to be a gradual thing, lots of annoying stuff that happens.

Jeremie (00:39:35):
To just to probe at that a little bit. So one of the things that I would imagine has made the progress and distribution of technology accelerate in the last 100 years for whatever period we choose is precisely communication. We talked about that quite a few times, the role the internet played and so on. And communication in particular, in terms of tightening feedback loops between the teams of people who design product, the teams of people who deploy it, the teams of people who sell it and so on. To the extent that integration that coherence is driven by communication. Would that undermine this argument in a sense of saying, “Well, if you have a single AI system that’s internally coherent and that’s able to essentially tighten that feedback loop, not infinitely but to machine time.” Do you find that that position interesting, I guess, is what I’m trying to ask?

Ben (00:40:28):
So I guess I find it interesting, but not persuasive. So I’d say there’s the idea of like if we jump to imagine that there’s a sudden jump to some extremely broadly capable AI system that just can serve you all of the economically relevant production tasks. It can do mining for chips, it can run ballot polling centers, it can do AI research, it can build more compute resources, it can manage without military strategic. If we imagine that there’s a single system that just abruptly comes into existence, that’s just itself doing all of this without interacting with outside factors or pulling on external resources. It does seem like there’s some intuition of like, stuff can happen faster because the communication efficiency costs have just gone down a lot. But there’s the questions like, should we imagine that this is the way development will work? That there’ll be like one single system that just abruptly gets all these capabilities. And I guess that’s something that I’m probably skeptical of in the case of AI and also again for somewhat boring reasons.

Ben (00:41:32):
So we do know that you can have progress in different areas at the same time. So something like the… I imagine probably a lot of your listeners are familiar with this, language models or this recent system GPT-3 developed by OpenAI. This is an example of a system that got pretty good at lots of different tasks through a single training process roughly the same time. So I was trained on a large corpus of basically webpages. And I was trained to basically try and predict what’s the least surprising, next word I could encounter on the basis of the words I’ve already encountered in a document I’m exposed to.

Ben (00:42:08):
So you can use it to do stuff like write a headline for a news article, and then I’ll try and think what’s the least surprising text for an article given this headline. And one thing people find is you can actually use it to do a lot of different stuff. So you can use it to do translation, for example, we can write a sentence in Spanish and say the English translation, the sentence is blank is calling. And the system will go out at least surprising thing to find next would basically be like the English translation of it and use it to write poetry. What’s the least surprising ending to this Emily Dickinson poem and that sort of thing.

Ben (00:42:42):
But even in these cases where lots of different capabilities in some sense, come online at once. You do still definitely see AI variation in terms of how good it is at different stuff. So it’s pretty bad for the most part at writing, like usable computer code. You can do a little bit of this, but basically can’t do it in a useful way at the moment. It’s pretty good at writing like Jabberwocky style poems, one of these came before the other. And there’s reason to think that can even be the case that’s going to be like an expanding thing where some capabilities come before others. There’s also some capabilities that you can’t really produce just purely through this GPT-3 style, train it on this large corpus of online things.

Ben (00:43:23):
If you want to translate the Department of Defense internal memos, it needs to be trained on something else. If you want it to write like healthcare legislation, probably [inaudible 00:43:30] is not going to do it for you. If you want it to set supermarket prices, at price inventory thing, or you personalize emails where it knows actually when to schedule meetings for you. You’re going to need a different training method. Or if you want to perform better than humans, you’re going to need a different training method as well, because you need to give it like… What it basically does is to try to say what would be the least surprising thing for person to have written on the internet. But if you want to do better than a person you’re going to need to use something else, some sort of feedback mechanism.

Ben (00:43:55):
So basically the reason I think different capabilities will come online at different times. There’ll also probably be lots of annoying stuff that comes up in different specific domains that doesn’t really show up to researchers. But tends to come up when you want to apply stuff like a law of going from [inaudible 00:44:07] to people actually using electric motors in factories, it’s like, you need to redesign your factory floor. Because it’s no longer based around the central steam engine. You need to redesign the things that’s using the hardware, you need redesign the processes that your workers use is actually leverage this thing. You have regulations that need to happen, et cetera, et cetera. And probably these things would need to be dealt with to some extent, at least initially by different teams. And some of them will be harder than others or require different resources than others. And I would basically be surprised if, this has been like a long way of saying I expect stuff to come online, that’d actually be really useful in the world at pretty different points for different tasks.

Jeremie (00:44:40):
Interesting. Yeah, that makes perfect sense. And what’s interesting to me is it’s exactly the kind of error that a theorist would make, imagining a system that… And not that it is an error, this scenario could easily come to pass. But these are interesting objections that seem to map onto the psychology of somebody who’s focused on theoretical optimization rather than optimization of systems and economies in practice. Interesting. So none of this though, seems to suggest that it would not be possible at some point in the future for an AI system with the ability to self-improve iteratively and [crosstalk 00:45:21] to be developed.

Jeremie (00:45:23):
So there’s two parts to this question. First off, A, do you think that that’s the case, or do you think that it will be possible to build such a system? And B, do you think such a system will be built or is likely to be built? Is there a series of incentives that stacks up to get us to a recursively self-improving AI that just goes [foom 00:45:47], eventually and does whatever? Is that a plausible story?

Ben (00:45:51):
Yeah. So I have a couple of bits here. So first bit is it’s unclear to me that recursive self-improvement will really be the thing. So clearly there are feedback loops and will be feedback loops in the future. So we see lots of technologies in a more limited way. So the existence software is useful for developing software. Software developers use software and computers are useful for designing computers. If people like Nvidia or any sort of hardware manufacturer didn’t have computers to use, they would probably find their jobs quite a bit harder. So there’s loads of cases where technologies where the aided design development or a technology aided development for another technology. It’s typically not recursive, or it’s not typically exactly the same artifact that’s improving itself.

Ben (00:46:44):
And in the case of AI, I don’t necessarily see a good reason to expect it to be recursive. I definitely expect AI to be applied more and more in the context of AI development searching for the right architecture. Or learning, figuring out what’s the most optimal way to basically develop another system or make it work well. But I don’t necessarily see a strong reason to think that’s the individual system doing it to itself, as opposed to a system that’s developed to help train other systems. The same way like software doesn’t tend to improve itself. I don’t really see a great benefit to it being recursive. It could be the case if that’s done, but I don’t see why it would be recursive, why that’s inherently more attractive. In some ways it seems maybe less attractive. It seems like somehow messier or it seems nice if this is a bit of a modular thing.

Jeremie (00:47:33):
Yeah, I guess, to some degree, just to bolster this argument a little bit from an engineering standpoint, I would imagine that… So there’s this abstraction of different systems, this term that we use to say there’s system A, there’s system, B. System A’s either improving itself or system B’s improving it, and then maybe system A… All that stuff. I guess what I’d thinking of in this case is an abstraction that covers something like a closed system that crucially operates on machine time. So the key distinction to my mind that would define like a takeoff of this form would be the fact that this either self-optimization or system A improves system B happens on the order of like microseconds. Or what have you such that humans do not intercede in the process and are ultimately surprised by the results where the results would deviate significantly from our expectations.

Ben (00:48:33):
Yeah. So I think maybe one of the key distinctions is labor basically involved in the improvement process. So one general counter to this AI feedback loop being really important to really increasing the rate of change that much. I guess we do already do have these feedback loops where loads of tasks that researchers or engineers would have been doing at the beginning of the 20th century, they just don’t do anymore. They’ve just been completely automated. So just actually doing calculations by hand is like a huge time sink. It’s like research effort for engineering. So there’s been massive, massive automation, in terms of the time that people spent doing, a huge portion of it’s been automated either way. So in that sense, there’s been this really strong feedback loop where technological progress has helped technological progress.

Ben (00:49:25):
But at least since the middle of the 20th century, we haven’t seen an increase in the rate of productivity growth, like technological progress, at least in leading countries. It seems to have actually gone slower, if anything. And the rate now is comparable to the beginning of the 20th century in the U.S.. So clearly this feedback loop isn’t enough on its own and there’s an offsetting thing and probably to mean the same thing as like this idea is getting hard to find phenomenon. Where technology helps you make new stuff, but also each new thing you want to make is a bit harder to make from the previous thing. Because if it was easy, you would’ve already done it. So that’s one general counter argument.

Ben (00:50:01):
And then the counter, counter argument to that is like, well this whole time that we’ve been automating lots of the tasks involved in research and then creating machines to do them and then improving the machines. Human labor has always been a part of it. And if you have this story where human labor stuffed on by capital basically is complementary. I think we have labor bottlenecks story where we keep making cooler machines and we keep making more machines. But there’s diminishing returns on the coolness of your machines or the quantity of your machines for fixed amount of research effort. So research effort’s really the bottleneck. It creates this diminishing returns phenomenon where it really limits the marginal value of the additional cool tech stuff that’s involved, done by researchers or owned by researchers. And then the number of researchers grows at this pretty constant exponential rate that can’t really be changed that easily because it’s linked to the population and things like that.

Ben (00:50:57):
So then I was talking, if you actually remove just human labor completely from the picture, just people are just not involved in R&D anymore or manufacturing. Then maybe in that case you no longer have this diminishing returns effect, you no longer have this bottleneck that you get diminishing returns on capital for like a fixed amount of labor. Maybe it just feeds back directly to itself, diminishing returns go away in some important sense. And then the feedback loop really takes off once you just completely remove humans from the loop, would be the story you could tell to say why the feedback loop will be different in the future than the non-explicit feedback loop we’ve had for the past century.

Jeremie (00:51:29):
And I guess there is a feedback of human self-improvement. I think clock time is the distinguishing characteristic here, but I do strive to improve myself in my productivity and I do strive to nail that myself. I try to improve the way I improve myself. In principle, I think I do that to an infinite number of derivatives or as close to that as matter. So there is an exponential quality to it, but clearly I’m not Elon Musk yet. I haven’t achieved hard take-off so there’s a difference there somewhere.

Ben (00:52:05):
Yeah. So I guess the thing I’d say there is probably that, I think you’re definitely right, that that’s a real phenomenon. I think though that the kind of orders of magnitude involved, how much you would have to self-improve is just smaller than it is for technology. So let’s imagine a researcher unit is a person in your laptop. And that’s the thing that produces the research. The person can actually make themselves better at coding and they can make themselves better at learning how to do things quickly, they can learn how to learn. But maybe the actual difference in productivity, maybe you can help to increase by like a factor of 10 in terms of human capital relative to what the average researcher in 2020 is. Whereas your laptop, it seems to get maybe has more [inaudible 00:52:44] to climb up in terms of how much better it can get than it is right now.

Jeremie (00:52:49):
That does unfortunately seem to be the case, but I just need to keep working at it. I think that’s what it needs.

Ben (00:52:56):
Yeah. I wish you best of luck in your race against your laptop’s rate of improvement.

Jeremie (00:52:59):
Yeah. Thanks. I’ll let you know if I hit take off. So that’s really interesting that you have done so much thinking on this and I can see in myself some shifts in terms of the way that you’re thinking about this, certainly there are aspects of it that I hadn’t considered before. That do come from this economics perspective that come from the systems perspective. Is this a way of thinking that you think is especially uncommon among technical AI safety people? Or are you starting to see that become adopted where… I’m still trying to piece together what the landscape looks like and how views have been shifting on this topic over time. Because just by way of example, I remember 2009, it was [inaudible 00:53:45]. Basically everybody was talking about this idea of a brain in a box or some fast takeoff thing where a machine self improves and so on.

Jeremie (00:53:54):
Whereas now it really does seem like between OpenAI, Paul Christiano, and a lot of the work being done at Future of Humanity Institute, things are shifting. And I’d love to get your perspective on that shift, that timeline and where the community now stands with respect to all these theses.

Ben (00:54:11):
Yeah. So I do definitely think there’s been a shift in the way, let’s say the median person in these communities is thinking about it. It’s a little bit ambiguous to me how much of it is a shift in terms of people who used to think one way shifting to another way of thinking. Versus more people entering the community with a preexisting different way of thinking. I do think that there is some element of people thinking about things in a bit more of a concrete way. What you think a lot of the older analysis, it’s very abstract. It’s very much relying on… It’s not exactly like mathematical, it’s like people doing an abstract algebra or something. But it’s definitely maybe like a more mathematical mindset.

Ben (00:54:58):
And it’s shifted over time. And I think one reason for that, which is very justifiable, it’s just when people are talking about this in the mid 2000s. Machine learning, wasn’t really a huge thing. People thought it would be more maybe logic oriented systems would be what maybe AGI would look like. Anything that really looked at all AGI-ish to really use as a model to think about. And I think as machine learning from took off and people started to have these systems, something like GPT-3 where obviously this is not AGI and probably AGI will be very different than that. It’s like a little bit of a stepping stone in the path to AGI. It’s like a little bit maybe AGI-ish or something.

Ben (00:55:41):
I think having these concrete examples just leads you to start thinking in a slightly different way. Where start to realize that they’re actually a little bit hard to describe in the context of maybe these abstract frameworks that you had before. So GPT-3 does it have a goal or if you want to predict this behavior, how useful. I guess it’s goal is to produce whatever next word would be unsurprising, but it’s somewhat doesn’t exactly feel right to think that way. It’s not clear how useful it is for predicting this behavior. It doesn’t really seem like there’s a risk of it doing something crazy like killing people to prevent them from stopping it from outputting. Somehow it just feels like it doesn’t really fit very well. And also just seeing more concrete applications and thinking… So I think just saying like Paul Christiano said, for example, to some extent being optimistic about, “Oh, I think you could actually probably do that thing with machine learning, not that far in the future without major breakthroughs.” Lends people to also think in a more continuous sense where it’s not all or nothing. It’s like you can see the stepping stones of intermediate transformations.

Ben (00:56:41):
So I think it’s seeing intermediate applications, having a bit more concreteness. And feeling a little bit like more skeptical of the abstract constituting, just because it’s hard to fit them onto the thing you’re seeing, or maybe some forces that have had an effect. Typically, I do definitely think that there are plenty of people who think that the more mathematical and classical way of approaching things is still quite useful or that may be the predominant way they approach things.

Jeremie (00:57:09):
Yeah. I actually have heard arguments… Not necessarily arguments that a system like GPT-3 would become pathological in the way you’ve described. But at least stories that can be told that sound internally consistent that describe worlds in which a system like that could go really badly wrong. In that case, it’s something like, imagine GPT-10, whatever the year would have to be for that to happen. And you had the system that necessarily, it is doing this like glorified auto-complete task. But in order to perform that task, one thing that seems clear is that it’s developing a fairly sophisticated model of the world. There’s some debate over the extent to which this is memorization versus actual generalizable learning. But let’s give GPT-3 the benefit of the doubt and assume it is generalizable learning. To the extent that that’s the case, the system continues to develop a more and more sophisticated model of the world, a larger and larger context window.

Jeremie (00:58:06):
Eventually that model of the world includes the fact that GPT-3 itself exists and is part of the world. Eventually this realization, as it tries to optimize its gradients makes it realize, “Oh I could develop direct control over my gradients through some kind of wire-heading,” is usually how it’s framed in the [crosstalk 00:58:25] community and so on. I think the problems that you described apply to this way of thinking. But it’s interesting how GPT-3 really has led to this concrete thinking about some of those abstracts.

Ben (00:58:39):
Yeah. I think it’s also very useful to have these concrete systems because I also think they force differences in intuition. Or force differences in comeback and assumptions to the surface. So just as one example, there’s definitely the cases that some people have expressed concern about these GPT systems or if you have GPT-10 then maybe this would be very dangerous. And I actually wouldn’t have guessed this. Or I guess I wouldn’t have guessed that other people had this intuition just because I didn’t have it. Because my baseline intuition is just basically too rough approximation the way the system works. It’s a model of some parameters and then it’s exposed to like a corpus of text. And it just basically outputs an X word and then the next word is actually right or it’s not. Or basically there’s a gradient that pushes the outputs to be less and less surprising relative to whatever the actual words in a data set are.

Ben (00:59:35):
It’s just basically being optimized for outputting words, which would be unsurprising to find as an X word in a piece of text, which is online somewhere. And when I think of GPT-10, I think, “Wow, I guess it just outputs words, which would be very unsurprising to find on webpages online.” It’s just like the thing that it does. And suppose, let’s say it does stuff like outputs words which lead people to destroy the world or something. It seems like it would only do that if those would be words that would be the most unsurprising to find online. If the words that lead it to destroy the world are not, would it be surprising to find online because people don’t normally write that sort of thing online. Then it seems like something weird has happened with the gradient descent process.

Jeremie (01:00:15):
So I think that’s a really great way to frame it. I believe the counter-argument to that might sound something like, we might look at human beings 200,000 years ago as sex optimizers or something like that. And then we find that we’re not that as our evolution has unfolded. I think the case here is that well, first off there’s a deep question as to what it is that a neural network actually is optimizing. It’s not actually clear that it’s optimizing its loss function or it feels a kick every time its gradients get updated. It goes like, “Oh, you’re wrong. Update all your rates by this.”

Jeremie (01:00:58):
Does that kick hurt? And if it does then, is that the true thing that’s being optimized by these systems? And then if that’s the case, then there’s this whole area obviously inner alignment that we’re skirting around here, but it’s a deep rabbit hole, I guess.

Ben (01:01:15):
So I sort of agree that there’s a distinction between the loss function that’s used when training the system and what this system acts like it’s trying to do. And there’s one really simple way of saying that is if you start with like a chess playing reinforcement learning system. And you have a reward function, that loss function associated with it, and you just haven’t trained it yet. It’s just not going to act like it’s trying to win at chess because that’s like one of the bluntest examples of like, it just doesn’t add up.

Ben (01:01:40):
And then obviously, you have these transplanting cases where you train a system in let’s say a video game where it gets points every time it opens a green box that’s on the left and on the right there’s like a red box. And you put it in a new environment where there’s a red box in the left and a green box on the right. And the training data you’ve given it so far, isn’t actually sufficient to distinguish, sounds like what is actually the thing that’s being rewarded. Is it for opening the red boxes or is it for opening the box on the left? And you shouldn’t be surprised if the system, for example, opens the box on the left, even though actually the thing that isn’t a loss function is the red box or vice versa. It wouldn’t be surprising if it’s generalized in the wrong way.

Ben (01:02:21):
So I certainly agree that there can generalization errors. I struggle to see why you would end up with, like in the case of something like GPT-3, I just don’t understand mechanistically what would be happening, where it would be… So let’s say that the concern is because it’s the text generation system that puts out some text where if it’s read by someone, it’s an engineering blueprint for something that kills everyone, let’s say. Which I don’t know if there’s like a non-sci-fi version of this where it leads to existential risk, but let’s say it’s the thing it does. I sometimes feel like I’m almost being… To answer or something or I’m missing something. But I just don’t understand mechanistically why would this grading design process lead it to have a policy that does that. Why would it in any way be optimized in that direction?

Jeremie (01:03:06):
The answer I would give, I’m sure not having put sufficient thought into this, I should preface. But is in principle, if we imagine, let’s say unlimited amount of compute, unlimited scale of data, and so on. This model would, let’s say it starts to think, and it thinks more and more and more and develops like a larger and larger and more complete picture of the world. Again, depending on what it’s trying to optimize, assuming it’s trying to optimize for minimizing its gradients. Here this is very course, I assume I’m wrong somehow, but somehow it feels like right to imagine that a neural network feels bad every time it gets kicked around. I don’t know.

Ben (01:03:47):
I don’t think it actually makes any sense, as much as it feels bad. I think it’s just, it has certain parameters and then it outputs something and it compares to the training set. And then based on the discrepancy, it’s [inaudible 01:04:02] kicked in a different direction. But I don’t think that there’s actually any internal… I don’t think there’s actually a meaningful sense which it feels bad. It has parameters that get nudged around by like a stick. It’s this guy with a stick, pushing the parameters in different directions on the basis of the discrepancy or lack of discrepancy, and then they eventually end up somewhere.

Jeremie (01:04:23):
Yeah. So this in and of itself is like, I think one of the coolest aspects. I’m about to get distracted by the inner alignment excitement here. But it’s one of the coolest aspects to me of the alignment debate, because it really gets you to the point of wondering about subjective experience and consciousness. Because there’s no way to have the conversation without saying, like, “This is some kind of learning process.” And learning process tends to produce an artifact like in humans, it’s a brain that seems to have some subjective experience, basically all life. You can look at an amoeba, move it around under a microscope. It really seems like it experiences pain and joy in different moments in different ways.

Jeremie (01:05:02):
So anyways, seeing these systems that behave in ways that could be interpreted similarly inspires at least in me questions about what is the link between the actual Mesa-objective, the function that the optimizer is really trying to improve and subjective experience. I’m going into territory I don’t understand nearly well enough. But maybe I can leave the thought at, I think this is a really exciting and interesting aspect of the problem as well. Do you think that consciousness and subjective experience have a role to play, the study of that in the context of these machines? Or are you-

Ben (01:05:44):
I think not so much of that. There’s a difficulty here where there’s obviously the different notions of consciousness people use. So I guess I predominantly think of it in I guess the David [inaudible 01:05:55] sense of conscious experience as this at least hypothesized phenomenological thing that’s not intrinsically a part of the… It’s not like a physical process, so it’s not a description of how something processes information. It’s an experience that’s layered on top of the mechanical stuff that happens in the brain. Whereas if you’re illusionist, you think that there is no such thing as this, and this is like a woo-woo thing. But I guess for that notion of consciousness, it doesn’t seem in a sense very directly relevant because it doesn’t actually have the weird aspects of it. It’s by definition or a hypothesis, not something that actually physically influences anything that’s happened somewhat behaviorally. And you could have zombies where they behave just the same way, but they don’t have this additional layer of consciousness on the top.

Ben (01:06:44):
So that version of consciousness, I don’t see as being very relevant to understanding how machine learning training works or how issues on MACE optimization work. And maybe there’s mechanistic things that people sometimes refer to using consciousness, which I think sometimes has to do with the information system. Somehow having representations of themselves is maybe one traits that people pick out sometimes when they use the term consciousness. It seems like maybe some of that stuff is relevant or maybe beliefs about what your own goals are, this sort of thing. Maybe this has some interesting relationship to optimization and human self-consciousness and things like that. So I could see a link there, but I guess this is all to say it depends a bit on the notion of consciousness that one has in mind.

Jeremie (01:07:38):
No, makes perfect sense. And it’s interesting how much these things do overlap with so many different areas from economics to theories of consciousness, theories of mind. Thanks so much for sharing your insights, Ben, I really appreciate it. Do you have a Twitter or a personal website that you’d like to share so people can check out your work because I think you’re working on fascinating stuff.

Ben (01:07:57):
Yeah. So I do have a personal website with very little on it, but there’s like a few papers I reference. That’s benmgarfinkel.com. And I have a Twitter account, but I’ve never tweeted from. I forget what my username is, but if you would like to find that and follow me, I may one day tweet from it.

Jeremie (01:08:15):
That is a compelling pitch. So everyone, look into the possibility of Ben tweeting some time.

Ben (01:08:22):
You could be among the first people to ever see a tweet from me if you get on the ground floor right now.

Jeremie (01:08:27):
They’re getting it at seed. This is time to invest seed stage. Awesome. Thanks so much, Ben. I will link to both those things including the Twitter.

Ben (01:08:36):
I look forward to the added Twitter followers.

Jeremie (01:08:40):
There you go. Yeah. Everybody, go and follow Ben, check out his website. And I’ll be posting some links as well in the blog post that will accompany this podcast just to… Some of the specific papers and pieces of work that Ben’s put together that we’ve referenced in this conversation because I think there’s a lot more to dig into there. So Ben, thanks a lot. Really appreciate it.

Ben (01:08:56):
Thanks so much. This was a super fun conversation.

How Do Data Scientists Use Twitter? Let Us Count the Ways

2021년 2월 18일 by hyungseok

Reading List

How Do Data Scientists Use Twitter? Let Us Count the Ways

Discover some of the best Twitter data analyses from the TDS archive.

Ben Huberman

1 day ago·4 min read

Anecdotally, it would seem that you can be on Twitter, or you can attempt to find joy, health, and balance in your life, but not both. The writers, journalists, and niche food-opinion-havers in my timeline form a fairly diverse lineup. What unites us is a strong ambivalence—yes, I’m mincing words here—about the space itself: the “hellsite” we can’t leave behind.

Digging into the TDS archive these past few weeks has had the unexpected effect of suggesting a different possibility—an alternate reality, even. Here were dozens of data scientists and AI experts spending massive amounts of time on Twitter and… doing productive things with it?! Not falling into ever-wilder spirals of despair?! Drawing thoughtful insights about the world?!?!

How was any of that possible?

It’s clear to me now that my use of the platform as a primary source of news has dramatically shaped my perception of it. If you mostly go on to Twitter to stay abreast of [waves hands randomly and indiscriminately at the world], it would stand to reason that the emotions you have about the news eventually converge with the ones you feel about the app. On the flip side, taking a step back to analyze how other people and communities use the platform — which Twitter has made possible thanks to its robust API—lends itself to blissful detachment (or at least a semblance of it), one that I one day hope to achieve as well.

To prove my point, here are some of my favorite TDS Twitter/data science crossover posts — read them! All of them, some of them, a couple; you won’t regret it. The archive runs extremely deep, so I assembled this collection after some major filtering, and across three categories: new and noteworthy, hands-on resources, and all-time greats. Let’s dive in.

New and noteworthy

Did I mention I use Twitter mostly for news consumption? The reason is the platform’s undeniable ability to capture political and cultural moments — sometimes it is the moment. From Black Lives Matter (and the occasionally fraught ways companies have attached themselves to the social movement), to Twitter users’ jubilant reception of Netflix hit Bridgerton over the holidays, these recent posts apply the tools of data science to study such moments in depth, and do it extremely well.

The Fortune 100 and Black Lives Matter

A dataset of Fortune 100 tweets during BLM protests reveals corporate America’s awkward relationship with social…

towardsdatascience.com

BRIDGERTON: An analysis of Netflix’s most-streamed TV series

An analysis of over 300,000 tweets on the Bridgerton TV series using NLP techniques in Python & Tableau

towardsdatascience.com

Trump’s Twitter Network

Social Media: The Challenge in Taking Action

towardsdatascience.com

Large-Scale Analysis of On-line Conversation about Vaccines before COVID-19

Discussion over the role and need of vaccines has never been so strong. How was it before COVID-19? Social media can…

towardsdatascience.com

Twitter Sentiment Analysis Based on News Topics during COVID-19

— How the online public respond to the pandemic

towardsdatascience.com

Hands-on resources

Many of our readers come to TDS because our community is where they find answers to the practical challenges they face in their work, in their studies, or in their passion projects. Twitter data analysis is no exception, and the posts gathered here provide clarity and step-by-step instructions on how to collect, clean, process, and draw insights from tweets.

Twitter data collection tutorial using Python

“Without data, you’re just another person with an opinion” — W. Edwards Deming

towardsdatascience.com

12 Twitter Sentiment Analysis Algorithms Compared

12 sentiment analysis algorithms were compared on the accuracy of tweet classification. The fasText deep learning…

towardsdatascience.com

Twitter Data Mining — Measuring Users’ Influence

Followers and Influence; A Scientific Insight

towardsdatascience.com

My First Twitter App

How to use Python and Tweepy to create your own dataset

towardsdatascience.com

Visualization of Information from Raw Twitter Data — Part 1

Lets explore what kind of information we can easily retrieve from raw Twitter data!

towardsdatascience.com

Generating Twitter Ego-Networks & Detecting Ego-Communities

A Graph-based approach to community detection in Twitter Networks

towardsdatascience.com

All-time greats

Yes, our archive on this topic is immense, but some posts — unlike most tweets!—have really stood the test of time, and are just as sharp and engaging today as they were when their authors first published them. They cover a wide range of topics, from Twitter’s own hiring process to detecting signs of depression in tweets’ linguistic markers. But first: CUTE DOGS.

Twitter Analytics: “WeRateDogs”

A Data Wrangling and Analysis Blog

towardsdatascience.com

What Twitter learned from the Recsys 2020 Challenge

We describe the insights from RecSys 2020 Challenge, for which Twitter provided sponsorship and a large dataset of user…

towardsdatascience.com

You Are What You Tweet

Detecting Depression in Social Media via Twitter Usage

towardsdatascience.com

Facebook and Twitter were born in 18th century Europe

How a simple real world puzzle gave rise to the mathematics that powers the biggest social media engines.

towardsdatascience.com

Another Twitter sentiment analysis with Python — Part 1

It has been a while since my last post. During my absence in Medium, a lot happened in my life. I finally gathered my…

towardsdatascience.com

4 Tips You Need To Know on NLP — from a Twitter Data Scientist

Exclusive look into Twitter’s NLP projects, interview process, and data science tools — TDS Interviews.

towardsdatascience.com

Let us know if any of the above has resonated with you in particular. Also: have you read a great post—here or on another site—featuring a Twitter-related data project? Have you written one yourself? Share it with us in the comments. Also-also: are there other topics you’d like to see us cover in a future reading list? Tell us that, too.

Georgia Tech’s MS Analytics Program: My Review Part II

2021년 2월 17일 by hyungseok

Georgia Tech’s MS Analytics Program: My Review Part II

Steven Finkelstein

After writing my first review of the OMSA program, I have been contacted by many prospective and current students. Because this program attracts students from a wide variety of backgrounds, providing advice that applies to everyone can be challenging. Below are some of the most common questions I receive and my attempt to answer them.

Is OMSA the right program for me?

If you read the program description on their degree overview page, they use the descriptor, “interdisciplinary”. I think this was the perfect word choice for the program. The range of topics covered in this program come from a wide variety of disciplines. The required courses cover finance, accounting, object-oriented programming, data analysis, machine learning, statistics, web development, cloud computing, data cleaning, scripting languages, and data visualization. The emphasis of this highly technical program is breadth over depth. If that sounds appealing, then this might be the program for you. However, if you wish to become an expert and specialize in any one of those individual disciplines, there are better options than this program.

Although this program emphasizes breadth, do not underestimate the difficulty level of the material. Being competent across business, computer science, and statistics is no easy feat. Often individuals will struggle with at least one of the disciplines covered in this program. If you are concerned with the technical rigor of the degree, I would advise you to improve your skills prior to enrollment or consider other programs. Other online quantitative master’s degrees include Data Science, Applied Economics, Statistics, Computer Science, Business Analytics, and an MBA: Analytics Concentration. Analytics programs run out of a university’s business school will likely be less rigorous in mathematics and computer science.

Which OMSA specialization should I choose?

After you finish the required courses, there are 3 specializations you can choose from. The word, specialization, is a bit misleading because the majority of the program is the same for every student. Only 6/36 credits are tailored to that specialization; otherwise, the rest is required or up to your discretion. You can quickly summarize the three specializations with the following:

Although the difficulty level is somewhat subjective, I think most students would agree with my assessment. The computer science courses tend to require the most hours of work per week.

Is there a particular order of classes that you would recommend?

With so many different disciplines covered in this program, it can be tricky to decide the appropriate order of classes to take. While deciding, there are many variables to take into account. Here are just a few of those questions to answer before you create your ideal 2 to 4-year graduate plan.

How many classes can you take per semester?
The average workload ranges from 8–20 hours per week per 3 credits, depending on the course. You need to consider which classes can be paired together based on the class expectations. Your individual time commitment will differ from the average depending on the strength of your skills in that subject. For example, if you have a strong background in statistics, then you can likely assume a below-average time commitment for statistics courses.

Which classes are offered each semester?
Summer semesters are shorter than Fall and Spring. Therefore, they only offer courses that can move at a quicker pace, which is a subset of the full class list. Because they run at a faster pace, the average workload per week is about 20 percent higher than average.

Which classes are optimal to take in close proximity?
It is much easier to take two related disciplines in close proximity to each other. Personally, I was at the peak of my programming skills after finishing Introduction for Computing for Data Analytics. I wish I took the hardest required computer science course, Data and Visual Analytics, immediately after this one. You should consider which classes pair well together as complementary subjects.

Is the OMSA program worth it?

To determine worth, you need to estimate the cost of the program and the expected return of completing it. The monetary cost for this program is 13,000 dollars*, but the labor cost is 2,160 hours** to complete the program. In terms of comparable graduate degrees, the tuition of this program is towards the lowest end of the spectrum. However, if you think the value of credentialism is waning in society, there are much cheaper options to learn this material (e.g. Coursera, Free online resources, etc.). If the cost of tuition doesn’t concern you the time commitment should. As stated previously, this program is about breadth in the analytics space rather than depth. If you wish to become an expert at a subset of analytics, such as experimentation, artificial intelligence, or data engineering, then your time is better spent elsewhere.

From a return perspective, the program primarily offers two things: a structured learning environment and, at a minimum, an entry-level data analytics position. The program offers a well-thought-out syllabus for most courses from experts in each of their respective domains. The assignments are organized and typically have auto-graders whenever possible for timely feedback. The environment includes multiple modes of communication to find assistance on assignments, including Piazza, Slack, and office hours sessions. Reading through dozens of posts on Piazza to find a hint at solving a particular problem can be a bit of a mess. This is about the extent to which they enable your learning. You will notice I did not mention lectures anywhere above. They can be hit or miss, depending on the course. Many lectures are so high level that you would be wiser to skip them entirely. Some lectures or assignments will include helpful links; however, the majority of your learning takes place outside of the GT environment through your own research (AKA Googling). As far as a return perspective, the knowledge you gain from this program is highly dependent upon the effort you put into learning the material versus simply getting a degree.

Moving onto the primary reason most people enroll in a graduate program, job opportunities. I don’t need to reiterate that data analytics skills are in hot demand. With increasing reliance on technology in every industry, I don’t see any future where computer science and statistics skills will not be marketable. Artificial intelligence will only increase our reliance on these skills because each automated process requires the installation and ongoing maintenance. Because there are so many types of individuals enrolled in this program, it is hard to say how much this degree will help you land the job you desire. In my opinion, a student graduating with an OMSA degree and zero years of work experience is qualified for your typical analytics job posting looking for 3–5 years of experience. Examples include data analysts, senior data analysts, business intelligence engineers, junior data scientists, and data science associates. However, if you want one of these positions at a top company or a more senior analytics role, you need to augment the degree with additional work experience or personal projects.

It’s up to you to decide whether adding the equivalent of an extra 3–5 years of data analytics experience is worth the cost of this program. I think the program is most valuable for those who fit at least one of the following criteria:

An experienced professional with some quantitative/programming background looking to switch careers into data analytics
An experienced analytics professional who wants to accelerate their professional advancement by 3–5 years
A student who needs structure to facilitate their learning
A student who is not afraid of complex problems that require persistence
An experienced professional who does not want to specialize in a specific data analytics or data science subject (e.g. Artificial Intelligence, Machine Learning Engineering, Data Engineering, Data Visualization, etc.)

*(275 per credit hr * 36 credit hours) + ((194 + 107 fees per semester) * 9 semesters)
**(12 hours per week per 3 credits * 15 weeks per semester) * (36/3 three credit blocks)

How can I prepare for this program?

If you decide to apply and enroll in this program, you need to confirm you are prepared for the material. The admission requirements are suboptimal in preparing students for this program. In order to be successful in this program, I would recommend having a certain level of experience across data analytics, mathematics, and computer science.

Data analytics topics and tools you should be familiar with:

SQL
Excel
Data visualization
Data cleaning
Data analysis

In terms of math, you should be comfortable with the following:

Calculus- integrals, derivatives, functions, limits
Statistics — hypothesis testing, p-values, confidence intervals, sampling
Probability- distributions, error
Linear Algebra- matrices, matrix operations, vectors, systems of equations

As far as computer science, you should have completed the equivalent of two, rigorous undergraduate computer science courses prior to this program. Rigorous means they required 10+ hours per week per 3 credits. You need to be an intermediate level programmer in at least one object-oriented programming language. This includes experience with debugging and diagnosing issues within code. Familiarity with web development will help, too. Most of all, you need an insatiable appetite for tackling difficult problems that will rarely be correct on the first attempt.

If you have any additional questions or comments, feel free to reach out. I am ALWAYS thinking about data analytics or finance.

~ The Data Generalist

Loading Multiple Well Log LAS Files Using Python

2021년 2월 16일 by hyungseok

Loading Multiple Well Log LAS Files Using Python

Appending Multiple LAS Files to a Pandas Dataframe

Andy McDonald

1 day ago·6 min read

Log ASCII Standard (LAS) files are a common Oil & Gas industry format for storing and transferring well log data. The data contained within is used to analyze and understand the subsurface, as well as identify potential hydrocarbon reserves. In my previous article: Loading and Displaying Well Log Data, I covered how to load a single LAS file using the LASIO library.

In this article, I expand upon that by showing how to load multiple las files from a subfolder into a single pandas dataframe. Doing this allows us to work with data from multiple wells and visualize the data quickly using matplotlib. It also allows us to prepare the data in a single format that is suitable for running through Machine Learning algorithms.

This article forms part of my Python & Petrophysics series. Details of the full series can be found here. You can also find my Jupyter Notebooks and datasets on my GitHub repository at the following link.

andymcdgeo/Petrophysics-Python-Series

This series of Jupyter Notebooks take you through various aspects of working with Python and Petrophysical data. A…

github.com

To follow along with this article, the Jupyter Notebook can be found at the link above and the data file for this article can be found in the Data subfolder of the Python & Petrophysics repository.

The data used for this article originates from the publicly accessible Netherlands NLOG Dutch Oil and Gas Portal.

Setting up the Libraries

The first step is to bring in the libraries we will be working with. We will be using five libraries: pandas, matplotlib, seaborn, os, and lasio.

Pandas, os and lasio will be used to load and store our data, whereas matplotlib and seaborn will allow us to visualize the contents of the wells.

Next we are going setup an empty list which will hold all of our las file names.

Secondly, in this example we have our files stored within a sub folder called Data/15-LASFiles/. This will change depending on where your files are stored.

We can now use the os.listdir method and pass in the file path. When we run this code, we will be able to see a list of all the files in the data folder.

From this code, we get a list of the contents of the folder.

['L05B03_comp.las',
 'L0507_comp.las',
 'L0506_comp.las',
 'L0509_comp.las',
 'WLC_PETRO_COMPUTED_1_INF_1.ASC']

Reading the LAS Files

As you can see above, we have returned 4 LAS files and 1 ASC file. As we are only interested in the LAS files we need to loop through each file and check if the extension is .las. Also, to catch any cases where the extension is capitalized (.LAS instead of .las), we need to call upon .lower() to convert the file extension string to lowercase characters.

Once we have identified if the file ends with .las, we can then add the path (‘Data/15-LASFiles/’) to the file name. This is required for lasio to pick up the files correctly. If we only passed the file name, the reader would be looking in the same directory as the script or notebook, and would fail as a result.

When we call the las_file_list we can see the full path for each of the 4 LAS files.

['Data/15-LASFiles/L05B03_comp.las',
 'Data/15-LASFiles/L0507_comp.las',
 'Data/15-LASFiles/L0506_comp.las',
 'Data/15-LASFiles/L0509_comp.las']

Appending Individual LAS Files to a Pandas Dataframe

There are a number of different ways to concatenate and / or append data to dataframes. In this article we will use a simple method of create a list of dataframes, which we will concatenate together.

First, we will create an empty list using df_list=[]. Then secondly, we will loop through the las_file_list, read the file and convert it to a dataframe.

It is useful for us to to know where the data originated. If we didn’t retain this information, we would end up with a dataframe full of data with no information about it’s origins. To do this, we can create a new column and assign the well name value: lasdf['WELL']=las.well.WELL.value. This will make it easy to work with the data later on.

Additionally, as lasio sets the dataframe index to the depth value from the file, we can create an additional column called DEPTH.

We will now create a working dataframe containing all of the data from the LAS files by concatenating the list objects.

When we call upon the working dataframe, we can see that we have our data from multiple wells in the same dataframe.

We can also confirm that we have all the wells loaded by checking for the unique values within the well column:

Which returns an array of the unique well names:


array(['L05-B-03', 'L05-07', 'L05-06', 'L05-B-01'], dtype=object)

If our LAS files contain different curve mnemonics, which is often the case, new columns will be created for each new mnemonic that isn’t already in the dataframe.

Creating Quick Data Visualizations

Now that we have our data loaded into a pandas dataframe object, we can create some simple and quick multi-plots to gain insight into our data. We will do this using crossplot/scatterplots, a boxplot and a Kernel Density Estimate (KDE) plot.

To start this, we first need to group our dataframe by the well name using the following:

Crossplot / Scatterplots Per Well

Crossplots (also known as scatterplots) are used to plot one variable against another. For this example we will use a neutron porosity vs bulk density crossplot, which is a very common plot used in petrophysics.

Using a similar piece of code that was previously mentioned on my Exploratory Data Analysis With Well Log Data article, we can loop through each of the groups in the dataframe and generate a crossplot (scatter plot) of neutron porosity (NPHI) vs bulk density (RHOB).

This generates the following image with 4 subplots:

Boxplot of Gamma Ray Per Well

Next up, we will display a boxplot of the gamma ray cuvre from all wells. The box plot will show us the extent of the data (minimum to maximum), the quartiles, and the median value of the data.

This can be achieved using a single line of code in the seaborn library. In the arguments we can pass in the workingdf dataframe for data, and the WELL column for the hue. The latter of which will split the data up into individual boxes, each with it’s own unique color.

Histogram (Kernel Density Estimate)

Finally, we can view the distribution of the values of a curve in the dataframe by using a Kernel Density Estimate plot, which is similar to a histogram.

Again, this example shows another way to apply the groupby function. We can tidy up the plot by calling up matplotlib functions to set the x and y limits.

Summary

In this article we have covered how to load multiple LAS files by searching a directory for all files with a .las extension and concatenate them into a single pandas dataframe. Once we have this data in a dataframe, we can easily call upon matplotlib and seaborn to make quick and easy to understand visualizations of the data.

Thanks for reading!

If you have found this article useful, please feel free to check out my other articles looking at various aspects of Python and well log data. You can also find my code used in this article and others at GitHub.

If you want to get in touch you can find me on LinkedIn or at my website.

Interested in learning more about python and well log data or petrophysics? Follow me on Medium.

Using the right tools to visualize data

2021년 2월 14일 by hyungseok

DATA VISUALIZATION

Using the right tools to visualize data

Tableau, ggplot2 & seaborn

Mubarak Ganiyu

When it comes to visualizing data, most people have a straightforward idea about what to do. They use scatterplots to display the relationships between two variables. Boxplots are used to compare the dispersion of distinct elements in a variable. Pie charts can be used to portray how different classes contribute as a whole to the variable. Time series plot can be used to display the progress made over time by someone or an organization.

Apart from having a solid idea of what chart to use, it is important to utilize a software package to create graphs and develop charts and there are multiple resources out there that can be used to make this possible. ggplot2 via R, seaborn via python, Tableau, PowerBI, MS Excel are among some of the famous platforms used to build charts.

This article will be focusing on the process it takes to build charts on three packages/platforms: Tableau, seaborn and ggplot2. The dataset that was utilized is the widely-used iris dataset. The iris dataset has five variables. Four of them are continuous variables: petal length, petal width, sepal length and sepal width. The last one is a categorical variable called species. It has three classes: setosa, virginica and versicolor.

By building the same charts across all three platforms, one can compare the quality of the charts and decide which one to use when working on data visualization projects. The two charts that were generated are:

A scatterplot that compares the relationship between sepal width and sepal length.
A bar chart that compares average values of the four variables across the different species.

The iris dataset is ready-made on both R-studio and Jupyter Notebooks. Therefore, it was easily exported for use on Tableau.

Tableau

Tableau is a platform that makes data visualization as easy as possible. Its huge advantage over python and R lies in the fact that it does not require code to load the dataset or to create graphs. Due to its drag and drop feature, it allows users to tinker around with the variables to build charts that effectively present information to its users. It also has other features that can be used to beautify charts and make them appealing to an audience.

A great example of Tableau in action. The chart was built and designed under one-and-a-half minute.

Tableau’s easy-to-use ability can be witnessed in the video above. A book that can act as a guide for beginners on how to master the art of using Tableau is Ben Jones’ Communicating Data with Tableau: Designing, Developing, and Delivering Data Visualizations. Other charts that were built using Tableau can be viewed below.

ggplot2

ggplot2 is an amazing package that is provided by R-studio. Unlike Tableau, it requires its users to import a package to build charts. Although it requires some coding, the syntax for coding is quite straightforward. Building a simple chart with ggplot2 involves two easy steps.

The first step is to load the tidyverse package. The ggplot2 package is one of the many packages provided by the tidyverse package. By loading the tidyverse package, users would also have access to other package’s functionality while designing graphs. The code for loading tidyverse can be viewed below.

install.packages("tidyverse")
library(tidyverse)

The second step is to use the coding syntax to generate a graph. The coding syntax can be seen below. ggplot() invokes the ggplot2 package and identifies the data to be used. geom_point() signifies that a scatterplot with points is the desired graph. By using aes() within the geom_point(), it was easy to map out what variables should appear on the x and y axis as well as group them according to their species. The labs() can be used to add title for the graph and label both the x and y axis. Setting the theme to classic using theme_classic() makes it possible for the user to control the theme setting.

If a user is interested in plotting a chart different from the one created above via ggplot2, this link can act as a guide for the user.

ggplot(data = df_iris) +
  geom_point(aes(x = sepal_width, y = sepal_length, color = species)) +
  labs(title = "Sepal length vs Sepal width", x = "Sepal width", y = "Sepal length") +
  theme_classic()

Seaborn

Seaborn is a package that is provided by python. It acts as an improvement to matplotlib, another data visualization package provided by python, to beautify graphs. Seaborn functions just like ggplot2 in the sense that it requires its users to load a package and uses a coding syntax to obtain the desired plot. Below is the code for loading the seaborn package and other useful packages that will make it easy to design the graph.

import seaborn as sns; sns.set_theme(style = "dark")
%matplotlib inline
import matplotlib.pyplot as plt

After loading the packages, the next step is to use the right functionalities to plot a chart. plt.figure() can be used to decide the size of the plot. sns.barplot() takes in the variables to be placed on the x and y axis as well as the dataset to be used. Like ggplot2, further changes to the appearance of the plot are made inside the sns.barplot() function. plt.title(), plt.xlabel() and plt.ylabel() are used to label the plot.

If a user is interested in plotting a chart different from the one above via seaborn, this link can act as a guide for the user.

plt.figure(figsize = (20,12))
sns.barplot(x = "species", y = "number", data = n_iris2, hue = "feature", palette = "deep")
plt.title("Bar chart of the average values of the features across species", fontsize = 20)
plt.xlabel("Species", fontsize = 12)
plt.ylabel("Average value", fontsize = 12)

Conclusion

All three platforms discussed above are amazing for designing and building graphs. Tableau is a great way for someone who is not interested in coding to easily generate charts. ggplot2 and seaborn are coding platforms that provide users with an open-ended approach to control the appearance of their graphs. When it comes to data visualization, your imagination is your only limit.

Below is a list of recommended articles on data visualization:

Thank you for reading!

Elevate Your Data Science Abilities: Learn Resourcefulness

2021년 2월 12일 by hyungseok

Elevate Your Data Science Abilities: Learn Resourcefulness

Join me in this new series where we explore the soft-skills that will elevate your effectiveness as a data scientist

Nick Cox

What is Resourcefulness?

The coconut octopus is perhaps one of the most resourceful animals in the world. Octopi are commonly known for their intelligence, but the coconut octopus stands out from the rest due to one amazing trait: they often carry coconuts and seashells to be used as armored protection.

Because they are usually 3 to 6 inches in length and non-poisonous, protection is a must to avoid being lunch for another marine animal. But that is not all. These resourceful undersea creatures are capable of bipedal locomotion, better known as walking. Talk about amazing!

To be resourceful is to have the ability to find quick and clever ways to overcome difficulties. It is to be a problem solver with the resources you have at hand. And in data science we have no shortage of problems that we need to solve in completing our work.

I grew up in New Zealand and we have an expression that we can build or fix anything using number 8 fencing wire. We also refer to this as kiwi ingenuity, which is a can-do attitude and ability to think laterally to solve a problem.

In New Zealand we take pride in being a resourceful people, but I was not born with this skill. I had to learn it. It is a skill that requires deliberate practice and attention to develop it. But when you do, it is a skill that will bring countless benefits to you and your employer.

What Are the Benefits of Being Resourceful?

Resourceful people are open-minded: they are open to new ideas, opinions and challenges. Resourceful people are avid readers and explorers and are always learning.
Resourceful people are self-assured: they are equally confident in what they know and in what they do not know. Resourceful people dare to ask for what they need.
Resourceful people are creative: this is where the concepts of number 8 wire and ingenuity come in to play the most. Resourceful people are open to new and different solutions. They do not believe that a problem must be solved one way because it always has been.
Resourceful people are proactive: they will not sit on their hands and wait for a solution to arrive. Resourceful people will stand up and take the lead, gathering like-minded colleagues to join them for the journey.
Resourceful people are persistent: they know that there are many ways to solve a problem and that sometimes the problem will not be solved on the first attempt. But, they will keep on trying until it is done.
Resourceful people are problem solvers: when faced with a new problem, resourceful people have the skills to both apply past learnings and to seek out new knowledge to find the best solutions.
Resourceful people are adaptable: they are not bound to a single approach to all problems and will gladly seek out the advice of colleagues to find alternative ways. In fact, they feel accomplished when they do.

How Do I Become Resourceful?

At the heart of it, being resourceful is about physical action. Here are some tangible ways for you to develop and demonstrate your resourcefulness:

Hone your research skills: In my mind, being resourceful is about knowing how and where to find the information that you need. Create a directory of useful resources that you can research when the need arises. As data scientists there are many resources at our disposal, often at no cost: Google Search, Git Hub, Medium, Stack Overflow and many more. Oftentimes you do not need to reinvent the wheel, as the solution or at least tidbits that will help get you there are already available.
Leverage your network: It cannot be overstated how important it is to have a strong network to reach out to. Do not be afraid to reach out to former colleagues, business partners and industry experts to ask for help. In addition to obtaining information to help you solve your current problem, you may also learn new skills to add to your toolkit for working on future problems.
Continually develop new skills: There is no better way to be resourceful than to add to the resources available to you. In data science there are always new tools, techniques and methods to learn. Keep up to date.
Know your strengths and weaknesses: The better that you understand these, the quicker you can determine whether you already have the resources and knowledge needed to solve the problem or whether you need to dive into research or reach out to your network.
Give yourself time to think and strategize: A pause to determine your strategy for solving a problem, what resources are required and where you can obtain them will pay endless dividends.
Be better at leveraging what you already have: A key to resourcefulness is understanding what resources you already have at your fingertips and how best to use them. It is oftentimes quicker and easier to use what you already have than to seek the resources elsewhere. But, also key is to quickly identify when you do not have what you need.

In addition to being a skill, resourcefulness is a mind-set, an attitude. Being resourceful requires a conscious effort to take action.

Concluding Remarks

To achieve what you want, you should not have to rely on the resources that you have or have not got. Your ability to become a resourceful person depends on how you use what you already have and how you obtain what you do not have but need.

As Tony Robbins once said, it is not the lack of resources, it is your lack of resourcefulness that stops you.

I have no doubt that my resourcefulness has led to successes in my career and in life. In fact, I enjoy the challenge of solving problems and I enjoy searching for solutions.

If you too become a resourceful person, you will influence others around you with better thinking and more creativity.

Please also check out my series Learn Python Data Analytics by Example: