Semiconductor Manufacturing¶
Project Description¶
A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning, and reduce per-unit production costs. These signals can be used as features to predict the yield type. And by analyzing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.
Context¶
Manufacturing process feature selection and categorization
Content¶
Abstract: Data from a semi-conductor manufacturing process
- Data Set Characteristics: Multivariate
- Number of Instances: 1567
- Area: Computer
- Attribute Characteristics: Real
- Number of Attributes: 591
- Date Donated: 2008-11-19
- Associated Tasks: Classification, Causal-Discovery
- Missing Values? Yes
A complex modern semi-conductor manufacturing process is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. It is often the case that useful information is buried in the latter two. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs.
To enhance current business improvement techniques the application of feature selection as an intelligent systems technique is being investigated.
The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where .1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point.
Acknowledgements¶
Dataset Authors: Michael McCann, Adrian Johnston
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import seaborn as sns
Importing the Dataset¶
dataset = pd.read_csv('SemiconductorManufacturingProcessDataset.csv')
Showing the Dataset in a Table¶
pd.DataFrame(dataset)
#dataset
Time | Sensor 1 | Sensor 2 | Sensor 3 | Sensor 4 | Sensor 5 | Sensor 6 | Sensor 7 | Sensor 8 | Sensor 9 | ... | Sensor 429 | Sensor 430 | Sensor 431 | Sensor 432 | Sensor 433 | Sensor 434 | Sensor 435 | Sensor 436 | Sensor 437 | Pass/Fail | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7/19/2008 11:55 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 97.6133 | 0.1242 | 1.5005 | 0.0162 | ... | 14.9509 | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | Pass |
1 | 7/19/2008 12:32 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 102.3433 | 0.1247 | 1.4966 | -0.0005 | ... | 10.9003 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | Pass |
2 | 7/19/2008 13:17 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 95.4878 | 0.1241 | 1.4436 | 0.0041 | ... | 9.2721 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | Fail |
3 | 7/19/2008 14:43 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 104.2367 | 0.1217 | 1.4882 | -0.0124 | ... | 8.5831 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | Pass |
4 | 7/19/2008 15:22 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.3967 | 0.1235 | 1.5031 | -0.0031 | ... | 10.9698 | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | Pass |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1562 | 10/16/2008 15:13 | 2899.41 | 2464.36 | 2179.7333 | 3085.3781 | 1.4843 | 82.2467 | 0.1248 | 1.3424 | -0.0045 | ... | 11.7256 | 0.4988 | 0.0143 | 0.0039 | 2.8669 | 0.0068 | 0.0138 | 0.0047 | 203.1720 | Pass |
1563 | 10/16/2008 20:49 | 3052.31 | 2522.55 | 2198.5667 | 1124.6595 | 0.8763 | 98.4689 | 0.1205 | 1.4333 | -0.0061 | ... | 17.8379 | 0.4975 | 0.0131 | 0.0036 | 2.6238 | 0.0068 | 0.0138 | 0.0047 | 203.1720 | Pass |
1564 | 10/17/2008 5:26 | 2978.81 | 2379.78 | 2206.3000 | 1110.4967 | 0.8236 | 99.4122 | 0.1208 | NaN | NaN | ... | 17.7267 | 0.4987 | 0.0153 | 0.0041 | 3.0590 | 0.0197 | 0.0086 | 0.0025 | 43.5231 | Pass |
1565 | 10/17/2008 6:01 | 2894.92 | 2532.01 | 2177.0333 | 1183.7287 | 1.5726 | 98.7978 | 0.1213 | 1.4622 | -0.0072 | ... | 19.2104 | 0.5004 | 0.0178 | 0.0038 | 3.5662 | 0.0262 | 0.0245 | 0.0075 | 93.4941 | Pass |
1566 | 10/17/2008 6:07 | 2944.92 | 2450.76 | 2195.4444 | 2914.1792 | 1.5978 | 85.1011 | 0.1235 | NaN | NaN | ... | 22.9183 | 0.4987 | 0.0181 | 0.0040 | 3.6275 | 0.0117 | 0.0162 | 0.0045 | 137.7844 | Pass |
1567 rows × 439 columns
Data Exploration¶
A Quick Review of the Data¶
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1567 entries, 0 to 1566 Columns: 439 entries, Time to Pass/Fail dtypes: float64(437), object(2) memory usage: 5.2+ MB
Feature Set: Center, Spread, and Range¶
dataset.describe()
Sensor 1 | Sensor 2 | Sensor 3 | Sensor 4 | Sensor 5 | Sensor 6 | Sensor 7 | Sensor 8 | Sensor 9 | Sensor 10 | ... | Sensor 428 | Sensor 429 | Sensor 430 | Sensor 431 | Sensor 432 | Sensor 433 | Sensor 434 | Sensor 435 | Sensor 436 | Sensor 437 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1561.000000 | 1560.000000 | 1553.000000 | 1553.000000 | 1553.000000 | 1553.000000 | 1558.000000 | 1565.000000 | 1565.000000 | 1565.000000 | ... | 1567.000000 | 1567.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 |
mean | 3014.452896 | 2495.850231 | 2200.547318 | 1396.376627 | 4.197013 | 101.112908 | 0.121822 | 1.462862 | -0.000841 | 0.000146 | ... | 5.563747 | 16.642363 | 0.500096 | 0.015318 | 0.003847 | 3.067826 | 0.021458 | 0.016475 | 0.005283 | 99.670066 |
std | 73.621787 | 80.407705 | 29.513152 | 441.691640 | 56.355540 | 6.237214 | 0.008961 | 0.073897 | 0.015116 | 0.009302 | ... | 16.921369 | 12.485267 | 0.003404 | 0.017180 | 0.003720 | 3.578033 | 0.012358 | 0.008808 | 0.002867 | 93.891919 |
min | 2743.240000 | 2158.750000 | 2060.660000 | 0.000000 | 0.681500 | 82.131100 | 0.000000 | 1.191000 | -0.053400 | -0.034900 | ... | 0.663600 | 4.582000 | 0.477800 | 0.006000 | 0.001700 | 1.197500 | -0.016900 | 0.003200 | 0.001000 | 0.000000 |
25% | 2966.260000 | 2452.247500 | 2181.044400 | 1081.875800 | 1.017700 | 97.920000 | 0.121100 | 1.411200 | -0.010800 | -0.005600 | ... | 1.408450 | 11.501550 | 0.497900 | 0.011600 | 0.003100 | 2.306500 | 0.013425 | 0.010600 | 0.003300 | 44.368600 |
50% | 3011.490000 | 2499.405000 | 2201.066700 | 1285.214400 | 1.316800 | 101.512200 | 0.122400 | 1.461600 | -0.001300 | 0.000400 | ... | 1.624500 | 13.817900 | 0.500200 | 0.013800 | 0.003600 | 2.757650 | 0.020500 | 0.014800 | 0.004600 | 71.900500 |
75% | 3056.650000 | 2538.822500 | 2218.055500 | 1591.223500 | 1.525700 | 104.586700 | 0.123800 | 1.516900 | 0.008400 | 0.005900 | ... | 1.902000 | 17.080900 | 0.502375 | 0.016500 | 0.004100 | 3.295175 | 0.027600 | 0.020300 | 0.006400 | 114.749700 |
max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 129.252200 | 0.128600 | 1.656400 | 0.074900 | 0.053000 | ... | 90.423500 | 96.960100 | 0.509800 | 0.476600 | 0.104500 | 99.303200 | 0.102800 | 0.079900 | 0.028600 | 737.304800 |
8 rows × 437 columns
Plotting the Raw Data¶
import warnings
warnings.filterwarnings('ignore')
with warnings.catch_warnings(): #Catch warnings in code section
warnings.simplefilter("ignore")
plt.subplots(figsize=(15,60));
ax = plt.gca();
dataset.hist(bins=30, figsize=(10,10), grid=False, layout=(146,3), sharex=False, ax=ax, alpha=0.5);
plt.tight_layout();
General Corelations¶
corr_mat = dataset.corr(method='pearson');
mask = np.triu(np.ones_like(corr_mat, dtype=bool));
plt.figure(dpi=1000);
plt.subplots(figsize=(20,15));
plt.title("Pearson's R Correlation Matrix");
sns.heatmap(corr_mat, annot=False, lw=0, linecolor='white', cmap='YlGnBu');
print();
<Figure size 6000x4000 with 0 Axes>