L - The type of the labels in the DatasetF - The type of the features in the Datasetpublic class RVFDataset<L,F> extends GeneralDataset<L,F>
ClassifierFactory that incrementally builds
a more memory-efficient representation of a List of RVFDatum
objects for the purposes of training a Classifier with a
ClassifierFactory.data, featureIndex, labelIndex, labels, size| Constructor and Description |
|---|
RVFDataset() |
RVFDataset(Index<F> featureIndex,
Index<L> labelIndex) |
RVFDataset(Index<L> labelIndex,
int[] labels,
Index<F> featureIndex,
int[][] data,
double[][] values)
Constructor that fully specifies a Dataset.
|
RVFDataset(int numDatums) |
RVFDataset(int numDatums,
Index<F> featureIndex,
Index<L> labelIndex) |
| Modifier and Type | Method and Description |
|---|---|
void |
add(Datum<L,F> d) |
void |
add(Datum<L,F> d,
java.lang.String src,
java.lang.String id) |
void |
addAll(java.lang.Iterable<? extends Datum<L,F>> data)
Adds all Datums in the given collection of data to this dataset
|
void |
addAllWithSourcesAndIds(RVFDataset<L,F> data) |
void |
applyFeatureCountThreshold(int k)
Applies a feature count threshold to the RVFDataset.
|
void |
applyFeatureMaxCountThreshold(int k)
Applies a feature max count threshold to the RVFDataset.
|
void |
clear()
Resets the Dataset so that it is empty and ready to collect data.
|
void |
clear(int numDatums)
Resets the Dataset so that it is empty and ready to collect data.
|
void |
ensureRealValues()
Checks if the dataset has any unbounded values.
|
RVFDatum<L,F> |
getDatum(int index) |
RVFDatum<L,F> |
getRVFDatum(int index) |
java.lang.String |
getRVFDatumId(int index) |
java.lang.String |
getRVFDatumSource(int index) |
RVFDatum<L,F> |
getRVFDatumWithId(int index) |
double[][] |
getValuesArray() |
protected void |
initialize(int numDatums)
This method takes care of resetting values of the dataset
such that it is empty with an initial capacity of numDatums.
|
java.util.Iterator<RVFDatum<L,F>> |
iterator() |
static void |
main(java.lang.String[] args) |
void |
printFullFeatureMatrix(java.io.PrintWriter pw)
prints the full feature matrix in tab-delimited form.
|
void |
printFullFeatureMatrixWithValues(java.io.PrintWriter pw)
Modification of printFullFeatureMatrix to correct bugs and print values
(Rajat).
|
void |
printSparseFeatureMatrix()
Prints the sparse feature matrix using
printSparseFeatureMatrix(PrintWriter) to System.out. |
void |
printSparseFeatureMatrix(java.io.PrintWriter pw)
Prints a sparse feature matrix representation of the Dataset.
|
void |
printSparseFeatureValues(int datumNo,
java.io.PrintWriter pw)
Prints a sparse feature-value output of the Dataset.
|
void |
printSparseFeatureValues(java.io.PrintWriter pw)
Prints a sparse feature-value output of the Dataset.
|
void |
randomize(long randomSeed)
Randomizes the data array in place.
|
void |
readSVMLightFormat(java.io.File file)
Read SVM-light formatted data into this dataset.
|
static RVFDataset<java.lang.String,java.lang.String> |
readSVMLightFormat(java.lang.String filename)
Constructs a Dataset by reading in a file in SVM light format.
|
static RVFDataset<java.lang.String,java.lang.String> |
readSVMLightFormat(java.lang.String filename,
Index<java.lang.String> featureIndex,
Index<java.lang.String> labelIndex)
Constructs a Dataset by reading in a file in SVM light format.
|
static RVFDataset<java.lang.String,java.lang.String> |
readSVMLightFormat(java.lang.String filename,
java.util.List<java.lang.String> lines)
Constructs a Dataset by reading in a file in SVM light format.
|
RVFDataset<L,F> |
scaleDataset(RVFDataset<L,F> dataset)
Scales the values of each feature in each linearly using the min and max
values found in the training set.
|
RVFDataset<L,F> |
scaleDatasetGaussian(RVFDataset<L,F> dataset) |
RVFDatum<L,F> |
scaleDatum(RVFDatum<L,F> datum)
Scales the values of each feature linearly using the min and max values
found in the training set.
|
RVFDatum<L,F> |
scaleDatumGaussian(RVFDatum<L,F> datum) |
void |
scaleFeatures()
Scales feature values linearly such that each feature value lies between 0
and 1.
|
void |
scaleFeaturesGaussian() |
void |
selectFeaturesFromSet(java.util.Set<F> featureSet)
Removes all features from the dataset that are not in featureSet.
|
<E> void |
shuffleWithSideInformation(long randomSeed,
java.util.List<E> sideInformation)
Randomizes the data array in place.
|
Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> |
split(double percentDev)
Divide out a (devtest) split from the start of the dataset and the rest of it (as a training set).
|
Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> |
split(int start,
int end)
Divide out a (devtest) split of the dataset versus the rest of it (as a training set).
|
void |
summaryStatistics()
Prints some summary statistics to stderr for the Dataset.
|
static RVFDatum<java.lang.String,java.lang.String> |
svmLightLineToRVFDatum(java.lang.String l) |
java.lang.String |
toString() |
java.lang.String |
toSummaryString() |
void |
writeSVMLightFormat(java.io.File file)
Write the dataset in SVM-light format to the file.
|
void |
writeSVMLightFormat(java.io.PrintWriter writer) |
featureIndex, getDataArray, getFeatureCounts, getLabelsArray, labelIndex, labelIterator, makeSvmLabelMap, mapDataset, mapDataset, mapDatum, numClasses, numDatumsPerLabel, numFeatures, numFeatureTokens, numFeatureTypes, printSVMLightFormat, printSVMLightFormat, retainFeatures, sampleDataset, size, splitOutFold, trimData, trimLabels, trimToSize, trimToSize, trimToSizepublic Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(double percentDev)
GeneralDatasetsplit in class GeneralDataset<L,F>percentDev - The first fractionSplit of datums (rounded down) will be the second splitpublic void scaleFeaturesGaussian()
public void scaleFeatures()
public void ensureRealValues()
public RVFDataset<L,F> scaleDataset(RVFDataset<L,F> dataset)
dataset - public RVFDatum<L,F> scaleDatum(RVFDatum<L,F> datum)
datum - public RVFDataset<L,F> scaleDatasetGaussian(RVFDataset<L,F> dataset)
public Pair<GeneralDataset<L,F>,GeneralDataset<L,F>> split(int start, int end)
GeneralDatasetsplit in class GeneralDataset<L,F>start - Begin devtest with this index (inclusive)end - End devtest before this index (exclusive)public RVFDatum<L,F> getDatum(int index)
getDatum in class GeneralDataset<L,F>public RVFDatum<L,F> getRVFDatum(int index)
getRVFDatum in class GeneralDataset<L,F>public java.lang.String getRVFDatumSource(int index)
public java.lang.String getRVFDatumId(int index)
public void addAllWithSourcesAndIds(RVFDataset<L,F> data)
public void addAll(java.lang.Iterable<? extends Datum<L,F>> data)
GeneralDatasetaddAll in class GeneralDataset<L,F>data - collection of datums you would like to add to the datasetpublic void clear()
clear in class GeneralDataset<L,F>public void clear(int numDatums)
clear in class GeneralDataset<L,F>numDatums - initial capacity of datasetprotected void initialize(int numDatums)
GeneralDatasetinitialize in class GeneralDataset<L,F>numDatums - initial capacity of datasetpublic void summaryStatistics()
summaryStatistics in class GeneralDataset<L,F>public void printFullFeatureMatrix(java.io.PrintWriter pw)
public void printFullFeatureMatrixWithValues(java.io.PrintWriter pw)
public static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename)
public static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename, java.util.List<java.lang.String> lines)
public static RVFDataset<java.lang.String,java.lang.String> readSVMLightFormat(java.lang.String filename, Index<java.lang.String> featureIndex, Index<java.lang.String> labelIndex)
public void selectFeaturesFromSet(java.util.Set<F> featureSet)
featureSet - public void applyFeatureCountThreshold(int k)
applyFeatureCountThreshold in class GeneralDataset<L,F>public void applyFeatureMaxCountThreshold(int k)
applyFeatureMaxCountThreshold in class GeneralDataset<L,F>public static RVFDatum<java.lang.String,java.lang.String> svmLightLineToRVFDatum(java.lang.String l)
public void readSVMLightFormat(java.io.File file)
file - The file from which the data should be read.public void writeSVMLightFormat(java.io.File file)
throws java.io.FileNotFoundException
readSVMLightFormat(File).file - The location where the dataset should be written.java.io.FileNotFoundExceptionpublic void writeSVMLightFormat(java.io.PrintWriter writer)
public void printSparseFeatureMatrix()
printSparseFeatureMatrix(PrintWriter) to System.out.printSparseFeatureMatrix in class GeneralDataset<L,F>public void printSparseFeatureMatrix(java.io.PrintWriter pw)
Object.toString() representations of features.printSparseFeatureMatrix in class GeneralDataset<L,F>public void printSparseFeatureValues(java.io.PrintWriter pw)
Object.toString() representations of features. This is probably
what you want for RVFDataset since the above two methods seem useless and
unused.public void printSparseFeatureValues(int datumNo,
java.io.PrintWriter pw)
Object.toString() representations of features. This is probably
what you want for RVFDataset since the above two methods seem useless and
unused.public static void main(java.lang.String[] args)
public double[][] getValuesArray()
getValuesArray in class GeneralDataset<L,F>public java.lang.String toString()
toString in class java.lang.Objectpublic java.lang.String toSummaryString()
public void randomize(long randomSeed)
randomize in class GeneralDataset<L,F>randomSeed - A seed for the Random object (allows you to reproduce the same ordering)public <E> void shuffleWithSideInformation(long randomSeed,
java.util.List<E> sideInformation)
shuffleWithSideInformation in class GeneralDataset<L,F>randomSeed - A seed for the Random object (allows you to reproduce the same ordering)